# llama.cpp

## llama.cpp Benchmarks

The `llama.cpp` benchmark suite was run across **Trinity Nano and Trinity Mini** using the same GPUs.

Benchmarks include:

* decode speed tests
* quantization sweeps
* context scaling
* real generation workloads

### RTX 3090

#### Nano decode

| Quantization | tg128           |
| ------------ | --------------- |
| Q4\_K\_M     | \~184–186 tok/s |
| bf16         | \~150 tok/s     |

#### Mini decode

| Quantization | tg128           |
| ------------ | --------------- |
| Q2\_K        | \~180–181 tok/s |
| Q4\_K\_M     | \~179–180 tok/s |
| Q5\_K\_M     | \~173 tok/s     |
| Q6\_K        | \~156–158 tok/s |

### RTX 4090

#### Nano decode

| Quantization | tg128         |
| ------------ | ------------- |
| Q4\_K\_M     | \~242.6 tok/s |
| bf16         | \~189.1 tok/s |

#### Mini decode

| Quantization | tg128               |
| ------------ | ------------------- |
| Q2\_K        | \~255.7–255.8 tok/s |
| Q4\_K\_M     | \~229–230 tok/s     |
| Q5\_K\_M     | \~216 tok/s         |
| Q6\_K        | \~202 tok/s         |

### RTX 5090

#### Nano decode

| Quantization | tg128           |
| ------------ | --------------- |
| Q2\_K        | \~197–205 tok/s |
| Q4\_K\_M     | \~199 tok/s     |
| Q8\_0        | \~209 tok/s     |
| bf16         | \~155–156 tok/s |

#### Mini decode

| Quantization | tg128           |
| ------------ | --------------- |
| Q2\_K        | \~237 tok/s     |
| Q4\_K\_M     | \~231–248 tok/s |
| Q5\_K\_M     | \~225 tok/s     |
| Q6\_K        | \~223–229 tok/s |

### Context scaling (RTX 5090)

| Model         | ctx 512       | ctx 32768    |
| ------------- | ------------- | ------------ |
| Nano Q4\_K\_M | \~12.6k tok/s | \~8.4k tok/s |
| Mini Q4\_K\_M | \~8.3k tok/s  | \~4.7k tok/s |

### Model compatibility

| Model              | Size      | RTX 3090      | RTX 4090      | RTX 5090      |
| ------------------ | --------- | ------------- | ------------- | ------------- |
| Trinity Mini Q8\_0 | \~27.8 GB | Not supported | Not supported | Supported     |
| Trinity Mini bf16  | \~52.3 GB | Not supported | Not supported | Not supported |
