Page cover

AFM-4.5B

AFM-4.5B is the first model available in the Arcee Foundation Model family. AFM-4.5B is a 4.5 billion parameter small language model, which delivers business performance comparable to much larger models at vastly lower hosting costs, while being efficient enough to run on low-RAM GPUs or even CPUs.

AFM-4.5B comes in two variants - base and instruct. The base model was trained on a dataset of 8 trillion tokens, comprising 6.5 trillion tokens of general pre-training data followed by 1.5 trillion tokens of mid-training data with enhanced focus on mathematical reasoning and code generation. Following pre-training, the model underwent supervised fine-tuning on high-quality instruction datasets. The instruction-tuned model was further refined through reinforcement learning on verifiable rewards as well as for human preference.

We used a modified version of TorchTitan for pre-training, Axolotl for supervised fine-tuning, and a modified version of Verifiers for reinforcement learning.

Both variants of AFM-4.5B are available on Hugging Face:

arcee-ai/AFM-4.5B

arcee-ai/AFM-4.5B-Base

Get Started

To get started deploying AFM-4.5B on a GPU, proceed to AFM-4.5B GPU Quick Deploy.

To get started deploying AFM-4.5B on a CPU, proceed to AFM-4.5B CPU Quick Deploy.

Model Summary

Name

AFM-4.5B

Parameters

4.5 billion

Architecture

Decoder-only Transformer

Activation Function

ReLU²

Attention

Grouped Query Attention

Training Tokens

8 trillion

Recommended Inference Parameters

  • temperature: 0.5

  • top_k: 50

  • top_p: 0.95

  • repeat_penalty: 1.1

Training Pipeline

  • Pre-training (6.5T tokens): General web, code, multilingual, and reasoning data.

  • Mid-training (1.5T tokens): Emphasis on math, programming, and structured reasoning.

  • Supervised Fine-tuning: High-quality instruction datasets for chat-style interactions.

  • RLHF: Reinforcement learning with verifiable reward models and human preference optimization.

  • Data Curation: Powered by DatologyAI, using model-based filtering, source mixing, and synthetic data synthesis.

Performance Characteristics

  • Factual Accuracy: Low hallucination rate due to clean, curated dataset.

  • Compliance: Minimal IP risk with exclusion of copyrighted books and restricted data.

  • Inference Efficiency: Suitable for real-time applications on lower-end GPUs or CPUs.

  • Multilingual: Supports Arabic, English, French, German, Hindi, Italian, Korean, Mandarin, Portuguese, Russian, and Spanish.

Performance Metrics

Hardware
Max Model Len
Quant
Max Concurrent Requests
TPS per Request*

H100 x 1

65536 (Max)

bf16

16

136

H100 x 1

4096

bf16

250

74.5

L40S x 1

8190

bf16

58

137

L40S x 1

4096

bf16

115

136

Intel CPU1

1024

Q4_0

4

29

Graviton42

1024

Q4_0

4

60

1 Intel Sapphire Rapids CPU with 32 threads

2AWS Graviton4 Instance with 32 vCPUs

*TPS benchmarks represent tokens per second per request at maximum concurrent requests. TPS will increase with fewer concurrent requests, so the benchmark numbers effectively represent minimum TPS.

Relevant Blogs

Announcing Arcee Foundation Models

Deep Dive: AFM-4.5B, the First Arcee Foundation Model

Is Running Language Models on CPU Really Viable?

Last updated