Consumer Hardware

Consumer GPU Performance

This section summarizes Trinity model performance on common single-GPU consumer hardware.

Benchmarks were run on:

GPU
VRAM

NVIDIA RTX 3090

24 GB

NVIDIA RTX 4090

24 GB

NVIDIA RTX 5090

~32 GB

The results are organized by inference framework:

  • vLLM Benchmarks Performance of Trinity Nano using vLLM, including request throughput, token throughput, and latency metrics.

  • llama.cpp Benchmarks Performance of Trinity Nano and Trinity Mini using GGUF quantizations across decode speed, context scaling, and generation workloads.

All results shown are from single-GPU runs to reflect typical workstation and desktop deployments.

Benchmark Coverage

The benchmark dataset includes:

  • throughput and latency measurements

  • quantization sweeps

  • decode speed benchmarks

  • context scaling tests

  • real generation workloads (QA, code generation, long-form text)

Last updated