AFM-4.5B

AFM-4.5B is the first model available in the Arcee Foundation Model family. AFM-4.5B is a 4.5 billion parameter small language model, which delivers business performance comparable to much larger models at vastly lower hosting costs, while being efficient enough to run on low-RAM GPUs or even CPUs.

AFM-4.5B comes in two variants - base and instruct. The base model was trained on a dataset of 8 trillion tokens, comprising 6.5 trillion tokens of general pre-training data followed by 1.5 trillion tokens of mid-training data with enhanced focus on mathematical reasoning and code generation. Following pre-training, the model underwent supervised fine-tuning on high-quality instruction datasets. The instruction-tuned model was further refined through reinforcement learning on verifiable rewards as well as for human preference.

We used a modified version of TorchTitan for pre-training, Axolotl for supervised fine-tuning, and a modified version of Verifiers for reinforcement learning.

Both variants of AFM-4.5B are available on Hugging Face:

arcee-ai/AFM-4.5B

arcee-ai/AFM-4.5B-Base

Deployment Quickstart

To get started deploying AFM-4.5B, proceed to AFM-4.5B Quick Deploys.

Model Summary

Name

AFM-4.5B

Parameters

4.5 billion

Architecture

Decoder-only Transformer

Activation Function

ReLU²

Attention

Grouped Query Attention

Training Tokens

8 trillion*

License

Apache 2.0

Recommended Inference Parameters

temperature: 0.5
top_k: 50
top_p: 0.95
repeat_penalty: 1.1

The blog linked in Training Tokens details the dataset curation process done by Arcee AI and Datology AI.

Training Pipeline

Pre-training (6.5T tokens): General web, code, multilingual, and reasoning data.
Mid-training (1.5T tokens): Emphasis on math, programming, and structured reasoning.
Supervised Fine-tuning: High-quality instruction datasets for chat-style interactions.
RLHF: Reinforcement learning with verifiable reward models and human preference optimization.
Data Curation: Powered by DatologyAI, using model-based filtering, source mixing, and synthetic data synthesis.

Performance Characteristics

Factual Accuracy: Low hallucination rate due to clean, curated dataset.
Compliance: Minimal IP risk with exclusion of copyrighted books and restricted data.
Inference Efficiency: Suitable for real-time applications on lower-end GPUs or CPUs.
Multilingual: Supports Arabic, English, French, German, Hindi, Italian, Korean, Mandarin, Portuguese, Russian, and Spanish.

Performance Metrics

Hardware

Max Model Len

Quantization

Max Concurrent Requests

TPS per Request*

H100 x 1

65536 (Max)

bf16

136

H100 x 1

4096

bf16

250

74.5

L40S x 1

8192

bf16

L40S x 1

4096

bf16

109

A10 x 1

8192

bf16

A10 x 1

4096

bf16

Intel CPU¹

1024

Q4_0

Graviton4²

1024

Q4_0

¹Intel Sapphire Rapids CPU with 32 threads

²AWS Graviton4 Instance with 32 vCPUs

TPS benchmarks represent tokens per second per request at maximum concurrent requests. TPS will increase with fewer concurrent requests, so the benchmark numbers effectively represent minimum TPS.

Relevant Blogs

Announcing Arcee Foundation Models

Deep Dive: AFM-4.5B, the First Arcee Foundation Model

Is Running Language Models on CPU Really Viable?

PreviousDeepGram NextYour First API Call

Last updated 20 hours ago