# AFM-4.5B

**AFM-4.5B** is the first model available in the Arcee Foundation Model family. AFM-4.5B is a 4.5 billion parameter small language model, which delivers enterprise performance comparable to much larger models at vastly lower hosting costs, while being efficient enough to run on low-RAM GPUs or even CPUs.

AFM-4.5B comes in two variants - base and instruct. The base model was trained on a dataset of 8 trillion tokens, comprising 6.5 trillion tokens of general pre-training data followed by 1.5 trillion tokens of mid-training data with enhanced focus on mathematical reasoning and code generation. Following pre-training, the model underwent supervised fine-tuning on high-quality instruction datasets. The instruction-tuned model was further refined through reinforcement learning on verifiable rewards as well as for human preference.&#x20;

We used a modified version of [TorchTitan](https://arxiv.org/abs/2410.06511) for pre-training, [Axolotl](https://axolotl.ai/) for supervised fine-tuning, and a modified version of [Verifiers](https://github.com/willccbb/verifiers) for reinforcement learning.

Both variants of AFM-4.5B are available on Hugging Face:

[arcee-ai/AFM-4.5B](https://huggingface.co/arcee-ai/AFM-4.5B)

[arcee-ai/AFM-4.5B-Base](https://huggingface.co/arcee-ai/AFM-4.5B-Base)

### Deployment Quickstart

To get started deploying AFM-4.5B, proceed to [AFM-4.5B Quick Deploys](https://docs.arcee.ai/language-models/broken-reference).

### Model Summary

|                                  |                                                                                                         |
| -------------------------------- | ------------------------------------------------------------------------------------------------------- |
| Name                             | `AFM-4.5B`                                                                                              |
| Parameters                       | 4.5 billion                                                                                             |
| Architecture                     | Decoder-only Transformer                                                                                |
| Activation Function              | ReLU²                                                                                                   |
| Attention                        | Grouped Query Attention                                                                                 |
| Training Tokens                  | [8 trillion](https://blog.datologyai.com/beyondweb/)\*                                                  |
| License                          | Apache 2.0                                                                                              |
| Recommended Inference Parameters | <ul><li>temperature: 0.5</li><li>top\_k: 50</li><li>top\_p: 0.95</li><li>repeat\_penalty: 1.1</li></ul> |

{% hint style="info" %}
The blog linked in Training Tokens details the dataset curation process done by Arcee AI and Datology AI.
{% endhint %}

### Training Pipeline

* **Pre-training (6.5T tokens)**: General web, code, multilingual, and reasoning data.
* **Mid-training (1.5T tokens)**: Emphasis on **math**, **programming**, and **structured reasoning**.
* **Supervised Fine-tuning**: High-quality instruction datasets for chat-style interactions.
* **RLHF**: Reinforcement learning with verifiable reward models and human preference optimization.
* **Data Curation**: Powered by **DatologyAI**, using model-based filtering, source mixing, and synthetic data synthesis.

### Performance Characteristics

* **Factual Accuracy**: Low hallucination rate due to clean, curated dataset.
* **Compliance**: Minimal IP risk with exclusion of copyrighted books and restricted data.
* **Inference Efficiency**: Suitable for real-time applications on lower-end GPUs or CPUs.
* **Multilingual**: Supports Arabic, English, French, German, Hindi, Italian, Korean, Mandarin, Portuguese, Russian, and Spanish.

### **Performance Metrics**

<table><thead><tr><th width="144.05078125">Hardware</th><th width="146.98828125">Max Model Len</th><th width="94.31640625">Quantization</th><th width="121.16796875">Max Concurrent Requests</th><th width="155.06640625">TPS per Request*</th></tr></thead><tbody><tr><td>H100 x 1</td><td>65536 (Max)</td><td>bf16</td><td>16</td><td>136</td></tr><tr><td>H100 x 1</td><td>4096</td><td>bf16</td><td>250</td><td>74.5</td></tr><tr><td>L40S x 1</td><td>8192</td><td>bf16</td><td>55</td><td>59</td></tr><tr><td>L40S x 1</td><td>4096</td><td>bf16</td><td>109</td><td>64</td></tr><tr><td>A10 x 1</td><td>8192</td><td>bf16</td><td>12</td><td>65</td></tr><tr><td>A10 x 1</td><td>4096</td><td>bf16</td><td>25</td><td>75</td></tr><tr><td>Intel CPU<sup>1</sup></td><td>1024</td><td>Q4_0</td><td>4</td><td>29</td></tr><tr><td>Graviton4<sup>2</sup></td><td>1024</td><td>Q4_0</td><td>4</td><td>60</td></tr></tbody></table>

<sup>1</sup> Intel Sapphire Rapids CPU with 32 threads

<sup>2</sup>AWS Graviton4 Instance with 32 vCPUs

{% hint style="info" %}
TPS benchmarks represent tokens per second per request at maximum concurrent requests. TPS will increase with fewer concurrent requests, so the benchmark numbers effectively represent minimum TPS.
{% endhint %}

### Relevant Blogs

[Announcing Arcee Foundation Models](https://www.arcee.ai/blog/announcing-the-arcee-foundation-model-family)

[Deep Dive: AFM-4.5B, the First Arcee Foundation Model](https://www.arcee.ai/blog/deep-dive-afm-4-5b-the-first-arcee-foundational-model)

[Is Running Language Models on CPU Really Viable?](https://www.arcee.ai/blog/is-running-language-models-on-cpu-really-viable)
