Trinity-Large-Preview

Overview

Trinity Large (Preview) is a 400B-parameter (13B active) sparse mixture-of-experts language model, engineered to scale model capacity while maintaining inference efficiency over long contexts, with strong performance in reasoning-heavy workloads including math, coding-related tasks, and multi-step agent workflows.

Key Features

  • Sparse mixture-of-experts architecture: Uses an extremely sparse MoE design with 400B total parameters and 13B activated per token. Sparse expert routing constrains per-token activation, enabling efficient inference at scale.

  • Long-context training and utilization: Trained at 256K sequence length with support for 512K inference (hosted at 128k), using architecture and training procedures designed to operate effectively over long inputs and extended multi-turn interactions over large inputs.

  • High throughput efficiency: Designed with inference-time efficiency as a primary objective, leveraging both extreme sparsity and optimized attention mechanisms to achieve strong throughput on modern accelerator hardware.

Deployment Quickstart

To get started deploying Trinity Large, download the model here and proceed to Quick Deploys

Model Summary

Name

Trinity-Large-Preview

Architecture

Mixture-of-Experts

Parameters

400 Billion Total, 13 Billion Active

Experts

256 Experts, 4 Active

Attention Mechanism

Grouped Query Attention (GQA)

Training Tokens

17 trillion

License

Apache 2.0

Recommended Inference Parameters

  • temperature: 0.8

  • top_p: 0.8

v

Last updated