# vLLM

## vLLM Benchmarks

vLLM benchmarks were run using **Trinity Nano**. Trinity Mini bf16 (\~52.3 GB) exceeds the VRAM capacity of the single-GPU consumer hardware used in these tests (3090/4090/5090), so Mini results are not included here.

#### Test configuration

* Input tokens: 512
* Output tokens: 256
* Prompts: 512
* Concurrency: 8
* Request rate: 8 rps

***

### RTX 3090

#### Performance

| Precision | Req/s | Output tok/s | Mean TTFT | p99 TTFT | TPOT / ITL | VRAM Used |
| --------- | ----- | ------------ | --------- | -------- | ---------- | --------- |
| bf16      | 2.87  | 735.67       | 47.32 ms  | 58.05 ms | 10.64 ms   | 23556 MiB |
| W4A16     | 3.97  | 1016.35      | 40.69 ms  | 51.35 ms | 7.66 ms    | 23594 MiB |

#### Max throughput

| Precision | Output tok/s |
| --------- | ------------ |
| bf16      | 1445.84      |
| W4A16     | 1710.57      |

***

### RTX 4090

#### Performance

| Precision | Req/s | Output tok/s | Mean TTFT | p99 TTFT | TPOT / ITL | VRAM Used |
| --------- | ----- | ------------ | --------- | -------- | ---------- | --------- |
| bf16      | 3.63  | 928.72       | 38.92 ms  | 42.00 ms | 8.41 ms    | 23910 MiB |
| W4A16     | 5.19  | 1328.33      | 33.48 ms  | 37.29 ms | 5.84 ms    | 23972 MiB |

#### Max throughput

| Precision | Output tok/s |
| --------- | ------------ |
| bf16      | 1991.78      |
| W4A16     | 2802.97      |

***

### RTX 5090

#### Performance

| Precision | Req/s | Output tok/s | Mean TTFT | p99 TTFT | TPOT / ITL | VRAM Used |
| --------- | ----- | ------------ | --------- | -------- | ---------- | --------- |
| bf16      | 4.10  | 1048.97      | 44.64 ms  | 54.29 ms | 7.40 ms    | 30487 MiB |
| W4A16     | 4.95  | 1267.71      | 43.54 ms  | 48.61 ms | 6.09 ms    | 30601 MiB |

#### Max throughput

| Precision | Output tok/s |
| --------- | ------------ |
| bf16      | 2312.41      |
| W4A16     | 2559.77      |
