Under The Hood : Trinity-Large-Thinking Disected
The benchmark numbers are solid: 91.9% on PinchBench, 63.2% on SWE-bench Verified, 88.0% on Tau2-Bench (Airline). But benchmarks are not what makes this release worth studying. What matters is the set of architectural choices Arcee made, and what those choices mean when you read the config files line by line.
Let's look under the hood: the architecture decisions, every significant config parameter, the training pipeline, and how it compares to the other major open-weight MoE models shipping today.
What the architecture does differently
Open the config.json and the first line reveals what this model is:
"architectures": ["AfmoeForCausalLM"]
This is not a Llama derivative or a Qwen fork. Arcee designed a custom architecture designated AfmoeForCausalLM (Arcee Frontier Mixture of Experts). It shares design patterns with DeepSeek-V3 (sigmoid routing, shared experts), but the combination of hybrid attention, extreme sparsity, and muP-enabled scaling is specific to Trinity. It ships with its own configuration_afmoe.py and modeling_afmoe.py via HuggingFace's auto_map, so the model definition lives alongside the weights. Every line of the forward pass is inspectable.
The overall architecture is shown in the block diagram below. The left column shows the transformer block: RMSNorm, attention (with GQA and gated output), residual connection, then RMSNorm into the MoE layer. The expanded views on the right show the Top-K Router selecting 4 of 256 routed experts plus 1 shared expert, and the Grouped-Query Attention with QK-normalization and conditional RoPE (applied only in local layers).
Source: Figure 2, "Arcee Trinity Large Technical Report," Singh et al., arXiv:2602.17004
Five design choices separate Trinity from the current open MoE field:
- Extreme sparsity. 256 routed experts, 4 active per token, 1 shared expert always on. A 1.56% routing fraction, among the most aggressive in production MoE models with multi-expert routing.
- Hybrid attention. A repeating 3:1 pattern of sliding-window and full-attention layers across 60 layers, with different positional encoding strategies per layer type.
- Sigmoid routing with SMEBU. Independent expert scoring with a novel load-balancing mechanism that actually converges, replacing the oscillating auxiliary-loss approach.
- muP-enabled scaling. Hyperparameters tuned on small proxy models (6B, 26B) transferred directly to 398B. Zero loss spikes across 17 trillion training tokens.
- 6 dense foundation layers. Twice the typical count, added specifically to stabilize routing at extreme sparsity.
Reading the config files
The real story of any model is in its config. Here is what the designers chose and why.
MoE: the expert configuration
"num_experts": 256,
"num_experts_per_tok": 4,
"num_shared_experts": 1,
"intermediate_size": 3072
Each routed expert has a SwiGLU FFN with intermediate dimension 3,072. When 4 fire simultaneously: 4 x 3,072 = 12,288 effective intermediate dim per token. The shared expert adds another 3,072, giving 15,360 effective width per forward pass. This keeps per-token compute roughly consistent with what a smaller dense model would use, while the full 256-expert pool provides broad knowledge capacity.
num_shared_experts: 1 means one expert processes every token as a residual baseline. If the router makes a bad decision, this expert catches it. DeepSeek-V3 and Kimi K2 use the same pattern.
How the expert configuration compares:
| Model | Routed Experts | Active per Token | Routing Fraction | Active Params | Total Params |
|---|---|---|---|---|---|
| Trinity Large | 256 | 4 | 1.56% | ~13B | ~398B |
| Kimi K2 | 384 | 8 | 2.08% | ~32B | ~1T |
| DeepSeek-V3 | 256 | 8 | 3.13% | ~37B | ~671B |
| Qwen3-235B | 128 | 8 | 6.25% | ~22B | ~235B |
| Llama 4 Maverick | 128 | 1 | 0.78% | ~17B | ~400B |
Llama 4 Maverick has a lower routing fraction on paper, but routes only 1 expert per token. Trinity's 4-of-256 gives each token four specialized expert networks while keeping total active parameters at ~13B. Kimi K2 pushes further with 384 experts and 8 active, but at 1T total parameters and ~32B active, it requires substantially more hardware to serve.
The routing strategy
"score_func": "sigmoid",
"route_norm": true,
"route_scale": 2.448,
"load_balance_coeff": 0.00005
Sigmoid scoring instead of softmax. Each expert's relevance score is computed independently between 0 and 1. No cross-expert competition in the scoring function. Competition is enforced only through top-k selection. This is the same approach Kimi K2 and DeepSeek-V3 use. Qwen3-235B and Llama 4 Maverick still use softmax.
route_scale: 2.448 amplifies router logits before the sigmoid, sharpening expert selection. For comparison, Kimi K2's routed_scaling_factor is 2.827 (close to the square root of 8, its active expert count). Trinity's 2.448 likely serves a similar variance-preservation role for 4 active experts.
load_balance_coeff: 0.00005 is a telling parameter. This extremely small auxiliary balance loss coefficient signals that Trinity relies primarily on SMEBU (Soft-clamped Momentum Expert Bias Updates) for load balancing rather than the traditional auxiliary loss approach.
Why this matters: standard auxiliary loss for MoE load balancing oscillates and never converges. A sign-based bias update overshoots, corrects, overshoots again. SMEBU replaces this with a tanh-clamped, momentum-smoothed bias update that reaches a stable equilibrium. The load_balance_coeff is kept near zero as a gentle supplementary nudge, not as the primary mechanism. The Arcee technical report describes SMEBU as one of the key factors behind their zero-loss-spike training run.
Hybrid attention: the 3:1 sliding/full pattern
"layer_types": ["sliding_attention", "sliding_attention", "sliding_attention", "full_attention", ...],
"sliding_window": 4096,
"global_attn_every_n_layers": 4,
"num_hidden_layers": 60
One of the more distinctive config choices. Of Trinity's 60 transformer layers, 45 use sliding-window attention with a 4,096-token window, and 15 use full global attention. Three local layers, then one global layer, repeating through the stack.
Sliding-window attention is dramatically cheaper to compute. Instead of every token attending to every other token (quadratic scaling), local layers only look at a 4,096-token neighborhood. The global layers, appearing every fourth layer, handle long-range dependencies and cross-document reasoning.
A critical subtlety: the global layers use no positional embeddings (NoPE), while the local layers use RoPE with "rope_theta": 10000. This separation lets the model learn different things at each scale. Local layers handle syntactic and short-range patterns with positional awareness. Global layers handle broader reasoning without position bias. The Arcee technical report notes this arrangement was "critical for effective long-context performance."
Neither DeepSeek-V3 nor Kimi K2 use a sliding/full attention hybrid. DeepSeek-V3 and Kimi K2 use Multi-head Latent Attention (MLA) with learned KV compression. Qwen3-235B uses standard GQA without sliding windows. Llama 4 Maverick uses interleaved attention but with a different pattern.
The practical result: Trinity was trained at 256K context ("max_position_embeddings": 262144) and scores 0.976 on MK-NIAH at 512K despite never training at that length. Performance degrades substantially beyond that (0.42 at 1M tokens). During context extension, only the global attention layers needed adjustment; the local layers stayed frozen.
Grouped Query Attention
"num_attention_heads": 48,
"num_key_value_heads": 8,
"head_dim": 128
48 query heads sharing 8 key-value heads gives a 6:1 GQA ratio. This cuts KV-cache memory by 6x versus full multi-head attention.
How it compares:
| Model | Attention Type | Query Heads | KV Heads | Compression Ratio |
|---|---|---|---|---|
| Trinity Large | GQA | 48 | 8 | 6:1 |
| Kimi K2 | MLA | 64 | 64 (latent compressed) | ~28x (via 512-dim latent) |
| DeepSeek-V3 | MLA | 128 | 128 (latent compressed) | ~28x (via 512-dim latent) |
| Qwen3-235B | GQA | 64 | 4 | 16:1 |
Qwen3-235B compresses more aggressively at 16:1 but at some cost to representational fidelity. Kimi K2 and DeepSeek-V3 use MLA, which compresses the entire KV representation into a 512-dimensional latent, achieving roughly 28x reduction through a fundamentally different mechanism. Trinity's 6:1 GQA is simpler to implement and provides a practical middle ground: enough compression to serve heavy concurrent loads (each user session requires its own KV-cache), enough key-value heads to preserve nuanced attention patterns.
Dense foundation layers
"num_dense_layers": 6
The first 6 of 60 layers are plain dense transformer layers with no MoE routing. A notable increase over peers: DeepSeek-V3 uses 3 dense layers ("first_k_dense_replace": 3). Kimi K2 uses 1. Most MoE models use 1 to 3.
Arcee increased this from their original plan of 3 to 6 specifically to stabilize routing at their extreme sparsity level. The early dense layers build stable token representations before the routing network has to decide which experts to activate. The Arcee technical report is candid here: they describe initial runs where routing behavior drifted, experts collapsed, and loss plateaued. Doubling the dense layers was part of a set of simultaneous fixes (alongside SMEBU, z-loss, QK-norm, depth-scaled sandwich norm, and RSDB) that stabilized training.
Normalization and training stability
"hidden_act": "silu",
"rms_norm_eps": 1e-05
SiLU activation in SwiGLU feed-forward networks. Standard choice across modern transformers (DeepSeek-V3, Kimi K2, Qwen3 all use the same).
RMSNorm with standard epsilon. But Trinity goes further with depth-scaled sandwich normalization: both input and output of each sublayer are normalized, and the output norm's gain is initialized to 1/sqrt(L) (approximately 0.129 for 60 layers). This progressively dampens residual contributions from deeper layers, preventing gradient explosions during training.
Additional stability mechanisms from the config and technical report:
- z-loss regularization on the routing logits, preventing expert scores from drifting to extreme values.
- QK-normalization in the attention layers.
- RSDB (Random Sequential Document Buffer), a novel data-loading technique developed mid-training to eliminate document boundary artifacts.
These mechanisms contributed to overall training stability alongside the SMEBU load balancing and dense foundation layers described above.
Vocabulary and embeddings
"vocab_size": 200192,
"tie_word_embeddings": false
A 200K-token BPE vocabulary, one of the largest in the open MoE space:
| Model | Vocab Size |
|---|---|
| Trinity Large | 200,192 |
| Llama 4 Maverick | 202,048 |
| Kimi K2 | 163,840 |
| Qwen3-235B | 151,936 |
| DeepSeek-V3 | 129,280 |
Custom-trained with place-aligned digit chunking for better arithmetic performance. Extended script-aware isolation covering Thai, Lao, Khmer, Myanmar, Korean, and CJK scripts on top of standard multilingual coverage.
Separate input and output embeddings (tie_word_embeddings: false). This increases parameter count slightly but gives the model more flexibility in mapping between input token representations and output logit predictions. Kimi K2 makes the same choice.
muP: scaling without guessing
"mup_enabled": true
Maximal Update Parametrization (muP) is enabled. Most open-source MoE models skip it. muP allows hyperparameters (learning rates, initialization scales) to transfer across model sizes. You tune once on a small proxy, and the values hold at full scale.
For Arcee, this was practically necessary. Their entire four-model program (Nano 6B, Mini 26B, Medium 82B, Large 398B) cost $20 million all-in. No budget for extensive hyperparameter sweeps at 398B. muP let them validate on Nano and Mini first, then scale up. The stable training run across 17T tokens suggests the transfer worked.
Other config details
"hidden_size": 3072for the model hidden dimension. Smaller than DeepSeek-V3's 7,168 or Kimi K2's 7,168. The knowledge capacity comes from expert breadth (256 experts), not hidden width."use_grouped_mm": trueenables grouped matrix multiplication for batching expert computations. Key for efficient MoE inference on GPUs.
Training: 17 trillion tokens on 2,048 B300s
Trinity-Large was pretrained on 17 trillion tokens from a 20-trillion-token dataset curated by DatologyAI. Training finished in 33 days on 2,048 NVIDIA B300 GPUs. According to Arcee, this is the largest publicly stated pretraining run on B300 hardware.
The training loss curve below shows the full run with no sub-sampling or smoothing. The vertical dashed lines mark the batch size increase (to 128M tokens per batch at 5T), the start of phase 2 (10T), and phase 3 (~14T). The curve is clean throughout, with no loss spikes.
Source: Figure 1, "Arcee Trinity Large Technical Report," Singh et al., arXiv:2602.17004
Data mix. Over 8 trillion of the 17T tokens are synthetic, generated using approaches building on BeyondWeb. The synthetic data covers web content (6.5T tokens, rephrased from high-quality seed documents), multilingual data (1T tokens across 14 languages), and code (800B tokens). The remainder is curated web-scale data covering programming, STEM, reasoning, and multilingual text.
Three-phase training. The 17T tokens were distributed across three phases with progressively shifting data mixes. Each subsequent phase increased the proportion of code, math, and science content. This progressive specialization approach avoids catastrophic forgetting of general capabilities.
Muon optimizer. Trinity used Muon for the hidden layers and AdamW only for the embedding and output layers. Kimi K2 also chose Muon (with their MuonClip variant). Muon enables a larger critical batch size and higher sample efficiency than AdamW alone. The batch size increased from 12,288 to 16,384 sequences after 4.9 trillion tokens.
Context extension. As noted in the hybrid attention section, training at 256K generalized to 512K without explicit long-context training. The asymmetric extension (only global layers adjusted, local layers frozen) is a direct consequence of the 3:1 sliding/full pattern.
Post-training: from base to reasoning model
The "Thinking" variant adds three post-training stages on top of the pretrained base:
Supervised fine-tuning on a blend of public instruction data, human-written prompts, synthetic instructions, and agentic coding trajectories collected through the OpenCode harness. The model learns full edit-run-test loops, not isolated code completions.
Reinforcement learning using prime-rl on 2,048 B300 GPUs with asynchronous vLLM-backed rollout workers. Verifiable rewards where possible (strict answer-format validation), learned reward model as fallback.
Extended chain-of-thought via
<think>...</think>blocks. The model generates explicit reasoning traces before responding. These traces must be preserved in conversation history for multi-turn agent loops to function.
The <think> blocks are not a prompting trick. They are a trained behavior from RL, where the model learned that explicit reasoning steps before action improve task completion rates in agentic scenarios.
How it compares architecturally
| Dimension | Trinity Large | Kimi K2/K2.5 | DeepSeek-V3 | Qwen3-235B | Llama 4 Maverick |
|---|---|---|---|---|---|
| Type | MoE | MoE | MoE | MoE | MoE |
| Total params | ~398B | ~1T | ~671B | ~235B | ~400B |
| Active params/token | ~13B | ~32B | ~37B | ~22B | ~17B |
| Experts (routed) | 256 | 384 | 256 | 128 | 128 |
| Active per token | 4 | 8 | 8 | 8 | 1 |
| Routing fraction | 1.56% | 2.08% | 3.13% | 6.25% | 0.78% |
| Attention | Hybrid GQA (sliding+full) | MLA | MLA | GQA | GQA |
| Hidden dim | 3,072 | 7,168 | 7,168 | 4,096 | 5,120 |
| Layers | 60 | 61 | 61 | 94 | 48 |
| Dense layers | 6 | 1 | 3 | 3 | Unknown |
| Context (native) | 256K (extends to 512K) | 256K (K2.5) | 128K | 32K (base) | 1M |
| Vocab size | 200,192 | 163,840 | 129,280 | 151,936 | 202,048 |
| Router scoring | Sigmoid | Sigmoid | Sigmoid | Softmax | Softmax |
| Optimizer | Muon + AdamW | MuonClip | AdamW | AdamW | AdamW |
| KV compression | GQA 6:1 | MLA (~28x) | MLA (~28x) | GQA 16:1 | GQA |
| muP | Yes | No | No | No | No |
| License | Apache 2.0 | Modified MIT | Custom | Apache 2.0 | Llama 4 |
Reading the table, the key architectural tradeoff becomes clear: Trinity is the most compute-efficient per token in its weight class. At ~13B active parameters, it processes each token with roughly a third of DeepSeek-V3's compute (~37B active) and about 60% of Qwen3's (~22B active). The cost is a smaller hidden dimension (3,072 vs 7,168), which means less representational capacity per expert. Trinity compensates with expert breadth over expert depth. It is also the only model in this comparison with muP and the only one using a hybrid sliding/full attention pattern.
Deployment reality
Serving Trinity-Large-Thinking requires multi-GPU infrastructure. The full FP16 model needs substantial memory, though the ~13B active parameters per token keep inference throughput high relative to total model size.
The inference benchmarks below compare Trinity-Large against DeepSeek-V3 and GLM 4.7, all running in FP8 on 8xH200 GPUs with vLLM 0.14.0. Trinity leads on output throughput and total throughput across all input/output size configurations, and shows the lowest time-per-output-token and time-to-first-token at most sizes. The sparsity (13B active vs 37B for DeepSeek-V3) translates directly into faster per-token generation.
Source: Figure 4, "Arcee Trinity Large Technical Report," Singh et al., arXiv:2602.17004
The model is supported on vLLM 0.11.1+ with the following flags:
--reasoning-parser deepseek_r1 --enable-auto-tool-choice --tool-call-parser qwen3_coder
Available deployment paths:
| Channel | Details |
|---|---|
| Weights on HuggingFace | arcee-ai/Trinity-Large-Thinking, Apache 2.0 |
| Arcee API | $0.90/M output tokens |
| OpenRouter | Full reasoning and tool-calling support |
| Self-hosted (vLLM) | Requires multi-GPU setup, vLLM 0.11.1+ |
| Dell Enterprise Hub | Optimized containers for Dell PowerEdge platforms |
The Dell Enterprise Hub ships inference containers with vLLM configurations for Dell PowerEdge hardware, covering balanced, high-concurrency, and long-context scenarios.
Enterprise use cases at a glance
The architecture choices translate to three deployment patterns worth noting briefly:
Agentic customer service. Multi-turn tool-calling with auditable
<think>traces. The 6:1 GQA ratio determines how many concurrent agent sessions fit per GPU node. Trinity scored 94.7% on Tau2-Bench (Telecom) and 88.0% on Tau2-Bench (Airline).Software engineering agents. Post-trained on full edit-run-test loops via OpenCode. The 512K context window holds entire microservice codebases without chunking. Scored 63.2% on SWE-bench Verified.
Long-document reasoning. The hybrid attention pattern makes 512K-token contexts economically feasible without quadratic cost across all layers. Self-hosting keeps sensitive documents on-premise.
In each case, the Apache 2.0 license enables fine-tuning on internal data, and the <think> blocks provide audit trails for regulated environments.


