vLLM

vLLM Benchmarks

vLLM benchmarks were run using Trinity Nano. Trinity Mini bf16 (~52.3 GB) exceeds the VRAM capacity of the single-GPU consumer hardware used in these tests (3090/4090/5090), so Mini results are not included here.

Test configuration

  • Input tokens: 512

  • Output tokens: 256

  • Prompts: 512

  • Concurrency: 8

  • Request rate: 8 rps


RTX 3090

Performance

Precision
Req/s
Output tok/s
Mean TTFT
p99 TTFT
TPOT / ITL
VRAM Used

bf16

2.87

735.67

47.32 ms

58.05 ms

10.64 ms

23556 MiB

W4A16

3.97

1016.35

40.69 ms

51.35 ms

7.66 ms

23594 MiB

Max throughput

Precision
Output tok/s

bf16

1445.84

W4A16

1710.57


RTX 4090

Performance

Precision
Req/s
Output tok/s
Mean TTFT
p99 TTFT
TPOT / ITL
VRAM Used

bf16

3.63

928.72

38.92 ms

42.00 ms

8.41 ms

23910 MiB

W4A16

5.19

1328.33

33.48 ms

37.29 ms

5.84 ms

23972 MiB

Max throughput

Precision
Output tok/s

bf16

1991.78

W4A16

2802.97


RTX 5090

Performance

Precision
Req/s
Output tok/s
Mean TTFT
p99 TTFT
TPOT / ITL
VRAM Used

bf16

4.10

1048.97

44.64 ms

54.29 ms

7.40 ms

30487 MiB

W4A16

4.95

1267.71

43.54 ms

48.61 ms

6.09 ms

30601 MiB

Max throughput

Precision
Output tok/s

bf16

2312.41

W4A16

2559.77

Last updated