llama.cpp

llama.cpp Benchmarks

The llama.cpp benchmark suite was run across Trinity Nano and Trinity Mini using the same GPUs.

Benchmarks include:

decode speed tests
quantization sweeps
context scaling
real generation workloads

RTX 3090

Nano decode

Quantization

tg128

Q4_K_M

~184–186 tok/s

bf16

~150 tok/s

Mini decode

Quantization

tg128

Q2_K

~180–181 tok/s

Q4_K_M

~179–180 tok/s

Q5_K_M

~173 tok/s

Q6_K

~156–158 tok/s

RTX 4090

Nano decode

Quantization

tg128

Q4_K_M

~242.6 tok/s

bf16

~189.1 tok/s

Mini decode

Quantization

tg128

Q2_K

~255.7–255.8 tok/s

Q4_K_M

~229–230 tok/s

Q5_K_M

~216 tok/s

Q6_K

~202 tok/s

RTX 5090

Nano decode

Quantization

tg128

Q2_K

~197–205 tok/s

Q4_K_M

~199 tok/s

Q8_0

~209 tok/s

bf16

~155–156 tok/s

Mini decode

Quantization

tg128

Q2_K

~237 tok/s

Q4_K_M

~231–248 tok/s

Q5_K_M

~225 tok/s

Q6_K

~223–229 tok/s

Context scaling (RTX 5090)

Model

ctx 512

ctx 32768

Nano Q4_K_M

~12.6k tok/s

~8.4k tok/s

Mini Q4_K_M

~8.3k tok/s

~4.7k tok/s

Model compatibility

Model

Size

RTX 3090

RTX 4090

RTX 5090

Trinity Mini Q8_0

~27.8 GB

Not supported

Not supported

Supported

Trinity Mini bf16

~52.3 GB

Not supported

Not supported

Not supported

PreviousvLLM NextInference Engines

Last updated 4 days ago