Under The Hood : Trinity-Large-Thinking Disected

Community Article Published May 1, 2026

Arcee AI's Trinity-Large-Thinking is a 398-billion parameter sparse Mixture-of-Experts model that activates ~13 billion parameters per token, uses a hybrid sliding-window and full-attention design across 60 layers, and ships under Apache 2.0. It was released on April 1, 2026.

The benchmark numbers are solid: 91.9% on PinchBench, 63.2% on SWE-bench Verified, 88.0% on Tau2-Bench (Airline). But benchmarks are not what makes this release worth studying. What matters is the set of architectural choices Arcee made, and what those choices mean when you read the config files line by line.

Let's look under the hood: the architecture decisions, every significant config parameter, the training pipeline, and how it compares to the other major open-weight MoE models shipping today.

What the architecture does differently

Open the config.json and the first line reveals what this model is:

"architectures": ["AfmoeForCausalLM"]

This is not a Llama derivative or a Qwen fork. Arcee designed a custom architecture designated AfmoeForCausalLM (Arcee Frontier Mixture of Experts). It shares design patterns with DeepSeek-V3 (sigmoid routing, shared experts), but the combination of hybrid attention, extreme sparsity, and muP-enabled scaling is specific to Trinity. It ships with its own configuration_afmoe.py and modeling_afmoe.py via HuggingFace's auto_map, so the model definition lives alongside the weights. Every line of the forward pass is inspectable.

The overall architecture is shown in the block diagram below. The left column shows the transformer block: RMSNorm, attention (with GQA and gated output), residual connection, then RMSNorm into the MoE layer. The expanded views on the right show the Top-K Router selecting 4 of 256 routed experts plus 1 shared expert, and the Grouped-Query Attention with QK-normalization and conditional RoPE (applied only in local layers).

image

Source: Figure 2, "Arcee Trinity Large Technical Report," Singh et al., arXiv:2602.17004

Five design choices separate Trinity from the current open MoE field:

  • Extreme sparsity. 256 routed experts, 4 active per token, 1 shared expert always on. A 1.56% routing fraction, among the most aggressive in production MoE models with multi-expert routing.
  • Hybrid attention. A repeating 3:1 pattern of sliding-window and full-attention layers across 60 layers, with different positional encoding strategies per layer type.
  • Sigmoid routing with SMEBU. Independent expert scoring with a novel load-balancing mechanism that actually converges, replacing the oscillating auxiliary-loss approach.
  • muP-enabled scaling. Hyperparameters tuned on small proxy models (6B, 26B) transferred directly to 398B. Zero loss spikes across 17 trillion training tokens.
  • 6 dense foundation layers. Twice the typical count, added specifically to stabilize routing at extreme sparsity.

Reading the config files

The real story of any model is in its config. Here is what the designers chose and why.

MoE: the expert configuration

"num_experts": 256,
"num_experts_per_tok": 4,
"num_shared_experts": 1,
"intermediate_size": 3072

Each routed expert has a SwiGLU FFN with intermediate dimension 3,072. When 4 fire simultaneously: 4 x 3,072 = 12,288 effective intermediate dim per token. The shared expert adds another 3,072, giving 15,360 effective width per forward pass. This keeps per-token compute roughly consistent with what a smaller dense model would use, while the full 256-expert pool provides broad knowledge capacity.

num_shared_experts: 1 means one expert processes every token as a residual baseline. If the router makes a bad decision, this expert catches it. DeepSeek-V3 and Kimi K2 use the same pattern.

How the expert configuration compares:

Model Routed Experts Active per Token Routing Fraction Active Params Total Params
Trinity Large 256 4 1.56% ~13B ~398B
Kimi K2 384 8 2.08% ~32B ~1T
DeepSeek-V3 256 8 3.13% ~37B ~671B
Qwen3-235B 128 8 6.25% ~22B ~235B
Llama 4 Maverick 128 1 0.78% ~17B ~400B

Llama 4 Maverick has a lower routing fraction on paper, but routes only 1 expert per token. Trinity's 4-of-256 gives each token four specialized expert networks while keeping total active parameters at ~13B. Kimi K2 pushes further with 384 experts and 8 active, but at 1T total parameters and ~32B active, it requires substantially more hardware to serve.

The routing strategy

"score_func": "sigmoid",
"route_norm": true,
"route_scale": 2.448,
"load_balance_coeff": 0.00005

Sigmoid scoring instead of softmax. Each expert's relevance score is computed independently between 0 and 1. No cross-expert competition in the scoring function. Competition is enforced only through top-k selection. This is the same approach Kimi K2 and DeepSeek-V3 use. Qwen3-235B and Llama 4 Maverick still use softmax.

route_scale: 2.448 amplifies router logits before the sigmoid, sharpening expert selection. For comparison, Kimi K2's routed_scaling_factor is 2.827 (close to the square root of 8, its active expert count). Trinity's 2.448 likely serves a similar variance-preservation role for 4 active experts.

load_balance_coeff: 0.00005 is a telling parameter. This extremely small auxiliary balance loss coefficient signals that Trinity relies primarily on SMEBU (Soft-clamped Momentum Expert Bias Updates) for load balancing rather than the traditional auxiliary loss approach.

Why this matters: standard auxiliary loss for MoE load balancing oscillates and never converges. A sign-based bias update overshoots, corrects, overshoots again. SMEBU replaces this with a tanh-clamped, momentum-smoothed bias update that reaches a stable equilibrium. The load_balance_coeff is kept near zero as a gentle supplementary nudge, not as the primary mechanism. The Arcee technical report describes SMEBU as one of the key factors behind their zero-loss-spike training run.

Hybrid attention: the 3:1 sliding/full pattern

"layer_types": ["sliding_attention", "sliding_attention", "sliding_attention", "full_attention", ...],
"sliding_window": 4096,
"global_attn_every_n_layers": 4,
"num_hidden_layers": 60

One of the more distinctive config choices. Of Trinity's 60 transformer layers, 45 use sliding-window attention with a 4,096-token window, and 15 use full global attention. Three local layers, then one global layer, repeating through the stack.

Sliding-window attention is dramatically cheaper to compute. Instead of every token attending to every other token (quadratic scaling), local layers only look at a 4,096-token neighborhood. The global layers, appearing every fourth layer, handle long-range dependencies and cross-document reasoning.

A critical subtlety: the global layers use no positional embeddings (NoPE), while the local layers use RoPE with "rope_theta": 10000. This separation lets the model learn different things at each scale. Local layers handle syntactic and short-range patterns with positional awareness. Global layers handle broader reasoning without position bias. The Arcee technical report notes this arrangement was "critical for effective long-context performance."

Neither DeepSeek-V3 nor Kimi K2 use a sliding/full attention hybrid. DeepSeek-V3 and Kimi K2 use Multi-head Latent Attention (MLA) with learned KV compression. Qwen3-235B uses standard GQA without sliding windows. Llama 4 Maverick uses interleaved attention but with a different pattern.

The practical result: Trinity was trained at 256K context ("max_position_embeddings": 262144) and scores 0.976 on MK-NIAH at 512K despite never training at that length. Performance degrades substantially beyond that (0.42 at 1M tokens). During context extension, only the global attention layers needed adjustment; the local layers stayed frozen.

Grouped Query Attention

"num_attention_heads": 48,
"num_key_value_heads": 8,
"head_dim": 128

48 query heads sharing 8 key-value heads gives a 6:1 GQA ratio. This cuts KV-cache memory by 6x versus full multi-head attention.

How it compares:

Model Attention Type Query Heads KV Heads Compression Ratio
Trinity Large GQA 48 8 6:1
Kimi K2 MLA 64 64 (latent compressed) ~28x (via 512-dim latent)
DeepSeek-V3 MLA 128 128 (latent compressed) ~28x (via 512-dim latent)
Qwen3-235B GQA 64 4 16:1

Qwen3-235B compresses more aggressively at 16:1 but at some cost to representational fidelity. Kimi K2 and DeepSeek-V3 use MLA, which compresses the entire KV representation into a 512-dimensional latent, achieving roughly 28x reduction through a fundamentally different mechanism. Trinity's 6:1 GQA is simpler to implement and provides a practical middle ground: enough compression to serve heavy concurrent loads (each user session requires its own KV-cache), enough key-value heads to preserve nuanced attention patterns.

Dense foundation layers

"num_dense_layers": 6

The first 6 of 60 layers are plain dense transformer layers with no MoE routing. A notable increase over peers: DeepSeek-V3 uses 3 dense layers ("first_k_dense_replace": 3). Kimi K2 uses 1. Most MoE models use 1 to 3.

Arcee increased this from their original plan of 3 to 6 specifically to stabilize routing at their extreme sparsity level. The early dense layers build stable token representations before the routing network has to decide which experts to activate. The Arcee technical report is candid here: they describe initial runs where routing behavior drifted, experts collapsed, and loss plateaued. Doubling the dense layers was part of a set of simultaneous fixes (alongside SMEBU, z-loss, QK-norm, depth-scaled sandwich norm, and RSDB) that stabilized training.

Normalization and training stability

"hidden_act": "silu",
"rms_norm_eps": 1e-05

SiLU activation in SwiGLU feed-forward networks. Standard choice across modern transformers (DeepSeek-V3, Kimi K2, Qwen3 all use the same).

RMSNorm with standard epsilon. But Trinity goes further with depth-scaled sandwich normalization: both input and output of each sublayer are normalized, and the output norm's gain is initialized to 1/sqrt(L) (approximately 0.129 for 60 layers). This progressively dampens residual contributions from deeper layers, preventing gradient explosions during training.

Additional stability mechanisms from the config and technical report:

  • z-loss regularization on the routing logits, preventing expert scores from drifting to extreme values.
  • QK-normalization in the attention layers.
  • RSDB (Random Sequential Document Buffer), a novel data-loading technique developed mid-training to eliminate document boundary artifacts.

These mechanisms contributed to overall training stability alongside the SMEBU load balancing and dense foundation layers described above.

Vocabulary and embeddings

"vocab_size": 200192,
"tie_word_embeddings": false

A 200K-token BPE vocabulary, one of the largest in the open MoE space:

Model Vocab Size
Trinity Large 200,192
Llama 4 Maverick 202,048
Kimi K2 163,840
Qwen3-235B 151,936
DeepSeek-V3 129,280

Custom-trained with place-aligned digit chunking for better arithmetic performance. Extended script-aware isolation covering Thai, Lao, Khmer, Myanmar, Korean, and CJK scripts on top of standard multilingual coverage.

Separate input and output embeddings (tie_word_embeddings: false). This increases parameter count slightly but gives the model more flexibility in mapping between input token representations and output logit predictions. Kimi K2 makes the same choice.

muP: scaling without guessing

"mup_enabled": true

Maximal Update Parametrization (muP) is enabled. Most open-source MoE models skip it. muP allows hyperparameters (learning rates, initialization scales) to transfer across model sizes. You tune once on a small proxy, and the values hold at full scale.

For Arcee, this was practically necessary. Their entire four-model program (Nano 6B, Mini 26B, Medium 82B, Large 398B) cost $20 million all-in. No budget for extensive hyperparameter sweeps at 398B. muP let them validate on Nano and Mini first, then scale up. The stable training run across 17T tokens suggests the transfer worked.

Other config details

  • "hidden_size": 3072 for the model hidden dimension. Smaller than DeepSeek-V3's 7,168 or Kimi K2's 7,168. The knowledge capacity comes from expert breadth (256 experts), not hidden width.
  • "use_grouped_mm": true enables grouped matrix multiplication for batching expert computations. Key for efficient MoE inference on GPUs.

Training: 17 trillion tokens on 2,048 B300s

Trinity-Large was pretrained on 17 trillion tokens from a 20-trillion-token dataset curated by DatologyAI. Training finished in 33 days on 2,048 NVIDIA B300 GPUs. According to Arcee, this is the largest publicly stated pretraining run on B300 hardware.

The training loss curve below shows the full run with no sub-sampling or smoothing. The vertical dashed lines mark the batch size increase (to 128M tokens per batch at 5T), the start of phase 2 (10T), and phase 3 (~14T). The curve is clean throughout, with no loss spikes.

image

Source: Figure 1, "Arcee Trinity Large Technical Report," Singh et al., arXiv:2602.17004

Data mix. Over 8 trillion of the 17T tokens are synthetic, generated using approaches building on BeyondWeb. The synthetic data covers web content (6.5T tokens, rephrased from high-quality seed documents), multilingual data (1T tokens across 14 languages), and code (800B tokens). The remainder is curated web-scale data covering programming, STEM, reasoning, and multilingual text.

Three-phase training. The 17T tokens were distributed across three phases with progressively shifting data mixes. Each subsequent phase increased the proportion of code, math, and science content. This progressive specialization approach avoids catastrophic forgetting of general capabilities.

Muon optimizer. Trinity used Muon for the hidden layers and AdamW only for the embedding and output layers. Kimi K2 also chose Muon (with their MuonClip variant). Muon enables a larger critical batch size and higher sample efficiency than AdamW alone. The batch size increased from 12,288 to 16,384 sequences after 4.9 trillion tokens.

Context extension. As noted in the hybrid attention section, training at 256K generalized to 512K without explicit long-context training. The asymmetric extension (only global layers adjusted, local layers frozen) is a direct consequence of the 3:1 sliding/full pattern.

Post-training: from base to reasoning model

The "Thinking" variant adds three post-training stages on top of the pretrained base:

  1. Supervised fine-tuning on a blend of public instruction data, human-written prompts, synthetic instructions, and agentic coding trajectories collected through the OpenCode harness. The model learns full edit-run-test loops, not isolated code completions.

  2. Reinforcement learning using prime-rl on 2,048 B300 GPUs with asynchronous vLLM-backed rollout workers. Verifiable rewards where possible (strict answer-format validation), learned reward model as fallback.

  3. Extended chain-of-thought via <think>...</think> blocks. The model generates explicit reasoning traces before responding. These traces must be preserved in conversation history for multi-turn agent loops to function.

The <think> blocks are not a prompting trick. They are a trained behavior from RL, where the model learned that explicit reasoning steps before action improve task completion rates in agentic scenarios.

How it compares architecturally

Dimension Trinity Large Kimi K2/K2.5 DeepSeek-V3 Qwen3-235B Llama 4 Maverick
Type MoE MoE MoE MoE MoE
Total params ~398B ~1T ~671B ~235B ~400B
Active params/token ~13B ~32B ~37B ~22B ~17B
Experts (routed) 256 384 256 128 128
Active per token 4 8 8 8 1
Routing fraction 1.56% 2.08% 3.13% 6.25% 0.78%
Attention Hybrid GQA (sliding+full) MLA MLA GQA GQA
Hidden dim 3,072 7,168 7,168 4,096 5,120
Layers 60 61 61 94 48
Dense layers 6 1 3 3 Unknown
Context (native) 256K (extends to 512K) 256K (K2.5) 128K 32K (base) 1M
Vocab size 200,192 163,840 129,280 151,936 202,048
Router scoring Sigmoid Sigmoid Sigmoid Softmax Softmax
Optimizer Muon + AdamW MuonClip AdamW AdamW AdamW
KV compression GQA 6:1 MLA (~28x) MLA (~28x) GQA 16:1 GQA
muP Yes No No No No
License Apache 2.0 Modified MIT Custom Apache 2.0 Llama 4

Reading the table, the key architectural tradeoff becomes clear: Trinity is the most compute-efficient per token in its weight class. At ~13B active parameters, it processes each token with roughly a third of DeepSeek-V3's compute (~37B active) and about 60% of Qwen3's (~22B active). The cost is a smaller hidden dimension (3,072 vs 7,168), which means less representational capacity per expert. Trinity compensates with expert breadth over expert depth. It is also the only model in this comparison with muP and the only one using a hybrid sliding/full attention pattern.

Deployment reality

Serving Trinity-Large-Thinking requires multi-GPU infrastructure. The full FP16 model needs substantial memory, though the ~13B active parameters per token keep inference throughput high relative to total model size.

The inference benchmarks below compare Trinity-Large against DeepSeek-V3 and GLM 4.7, all running in FP8 on 8xH200 GPUs with vLLM 0.14.0. Trinity leads on output throughput and total throughput across all input/output size configurations, and shows the lowest time-per-output-token and time-to-first-token at most sizes. The sparsity (13B active vs 37B for DeepSeek-V3) translates directly into faster per-token generation.

image

Source: Figure 4, "Arcee Trinity Large Technical Report," Singh et al., arXiv:2602.17004

The model is supported on vLLM 0.11.1+ with the following flags:

--reasoning-parser deepseek_r1 --enable-auto-tool-choice --tool-call-parser qwen3_coder

Available deployment paths:

Channel Details
Weights on HuggingFace arcee-ai/Trinity-Large-Thinking, Apache 2.0
Arcee API $0.90/M output tokens
OpenRouter Full reasoning and tool-calling support
Self-hosted (vLLM) Requires multi-GPU setup, vLLM 0.11.1+
Dell Enterprise Hub Optimized containers for Dell PowerEdge platforms

The Dell Enterprise Hub ships inference containers with vLLM configurations for Dell PowerEdge hardware, covering balanced, high-concurrency, and long-context scenarios.

Enterprise use cases at a glance

The architecture choices translate to three deployment patterns worth noting briefly:

  • Agentic customer service. Multi-turn tool-calling with auditable <think> traces. The 6:1 GQA ratio determines how many concurrent agent sessions fit per GPU node. Trinity scored 94.7% on Tau2-Bench (Telecom) and 88.0% on Tau2-Bench (Airline).

  • Software engineering agents. Post-trained on full edit-run-test loops via OpenCode. The 512K context window holds entire microservice codebases without chunking. Scored 63.2% on SWE-bench Verified.

  • Long-document reasoning. The hybrid attention pattern makes 512K-token contexts economically feasible without quadratic cost across all layers. Self-hosting keeps sensitive documents on-premise.

In each case, the Apache 2.0 license enables fine-tuning on internal data, and the <think> blocks provide audit trails for regulated environments.

Resources for further reading

Community

Sign up or log in to comment