--- license: apache-2.0 language: - en library_name: transformers pipeline_tag: text-generation tags: - odinnext - hgrn2 - linear-attention - recurrent - causal-lm - custom_code - early-checkpoint - fp16 - amd - rocm - arxiv:2404.07904 - arxiv:2605.06546 - arxiv:2407.12665 - arxiv:2506.14202 --- # OdinNext-138M-Early-Checkpoint Early research checkpoint of **OdinNext**, a 138M-parameter causal language model using an HGRN2-style gated linear recurrence instead of softmax self-attention. This is **not** a chat model and not a production release. It is an early pretraining checkpoint intended for architecture inspection, qualitative sampling, and continued research. - **Repo:** `joelhenwang/OdinNext-138M-Early-Checkpoint` - **Recommended revision:** `main` / EMA-shadowed weights - **Training status:** early checkpoint at step 3,259 - **Context window:** 2,048 tokens in the released inference code - **License:** Apache-2.0 > The model uses custom Transformers code. Loading it with `trust_remote_code=True` executes Python code from this repository. Only do this after reviewing the files or pinning a known commit. ## At a glance | Item | Value | |---|---:| | Unique tied parameters | **138,449,696** | | Non-embedding parameters | **113,283,872** | | Layers | 16 | | Hidden size | 768 | | Heads | 6 | | Head state dims | 128 × 128 per head | | FFN inner size | 2,048 | | Vocabulary | 32,768 custom BPE tokens | | Max sequence length | 2,048 | | Checkpoint dtype | fp16 | | Architecture | HGRN2 recurrence + alternating RoPE + SwiGLU² FFN + RMSNorm-style normalization | | Cache type | Fixed recurrent state, not a growing Transformer KV cache | ## What this checkpoint is good for Use this checkpoint for: - inspecting a compact recurrent/linear-attention LM implementation; - testing HGRN2-style recurrent decoding inside the Hugging Face `generate()` API; - studying fixed-state decoding memory behavior; - continuing pretraining or running controlled ablations. Do **not** use it for: - chat, instruction following, or agentic tasks; - safety-sensitive output generation; - benchmark claims without running your own evaluation; - multilingual, coding, or long-context claims. ## Architecture OdinNext is a decoder-only causal LM. Each block uses a pre-norm residual layout: ```text x = x + sigmoid(gate_attn) * HGRN2(norm(x)) x = x + sigmoid(gate_ffn) * SwiGLU²(norm(x)) ``` The HGRN2-style recurrent state is updated per token as: ```text S_t = diag(exp(g_t)) S_{t-1} + k_t ⊗ v_t o_t = q_t S_t ``` where each layer keeps a per-batch recurrent state shaped: ```text [B, n_heads, head_f_dim, head_i_dim] ``` For this checkpoint: ```text n_heads = 6 head_f_dim = 128 head_i_dim = 128 ``` Even-numbered layers apply RoPE to `q` and `k`; odd-numbered layers are position-free. The current inference implementation still enforces a hard 2,048-token cumulative position limit because the RoPE cache is built for `max_seq_len = 2048`. ### Important implementation details - The exported Hugging Face code contains only the inference path. Training-time machinery is not part of this repository. - `past_key_values` is an `OdinNextCache`, a list of recurrent states. It is **not** a Transformer KV cache. - `attention_mask` is accepted for API compatibility but ignored by the backbone. Left-padding is not supported. - Batched generation is safest when all prompts have the same valid length. Padding tokens are still tokens to the recurrence if they are processed. - `use_cache=True` is important for generation. Without it, every generation step reprocesses the full prefix. ## Parameter accounting The 138M headline is the **unique tied-parameter runtime count**. The input embedding and LM head are tied and should be counted once for model-capacity comparisons. Hugging Face or file-size-derived parameter summaries may round this checkpoint near 0.2B because stored checkpoint tensors and tied runtime parameters are not always counted the same way. ## Memory: recurrent state vs Transformer KV cache For batch size 1 in fp16, OdinNext's recurrent state size is: ```text layers × heads × head_f_dim × head_i_dim × bytes = 16 × 6 × 128 × 128 × 2 = 3,145,728 bytes ≈ 3.0 MiB ``` That state is constant with respect to generated context length. It scales linearly with batch size and with dtype size. In the pure-PyTorch fallback path, the scan state is promoted to fp32, so the returned recurrent state can be about **6.0 MiB per sequence** instead of 3.0 MiB. A same-depth 16-layer, `d_model = 768`, fp16 Transformer with full multi-head K/V cache would use approximately: ```text layers × 2(K,V) × hidden_size × context_tokens × bytes = 16 × 2 × 768 × T × 2 ``` | Context tokens | Typical Transformer KV cache | OdinNext recurrent state | |---:|---:|---:| | 1,024 | 48 MiB | ~3 MiB fp16 / ~6 MiB fp32 fallback | | 4,096 | 192 MiB | ~3 MiB fp16 / ~6 MiB fp32 fallback | | 16,384 | 768 MiB | ~3 MiB fp16 / ~6 MiB fp32 fallback | | 65,536 | 3,072 MiB | ~3 MiB fp16 / ~6 MiB fp32 fallback | This table is a cache-state comparison only. It is not a claim about total GPU memory, throughput, benchmark quality, or usable context length. The released OdinNext code is still limited to 2,048 cumulative positions. ## Training snapshot Values verified from the public config: | Field | Value | |---|---:| | `_training_step` | 3,259 | | `_total_tokens` | 6,835,666,944 | | `_weights_source` | `ema_state_dict` | | `torch_dtype` | `float16` | | `max_position_embeddings` | 2,048 | Author-reported training notes for this early checkpoint: | Item | Value | |---|---| | Hardware | 2× AMD Strix Halo / gfx1151, ROCm stack | | Training precision | fp16 + GradScaler | | Optimizers | NorMuon for 2D tensors; AdamW for 1D/embed tensors | | LR schedule | WSD, peak `8e-4`, warmup 500, min LR 0.1× peak | | Stabilization | z-loss `1e-4`, attention soft-cap 50, EMA decay 0.999 | | Curriculum | TST-style bag-size-4 phase active at this checkpoint | | Public benchmarks | not yet provided | ### Token accounting note The public config records `_total_tokens = 6,835,666,944`. Do not reinterpret that as plain next-token positions from: ```text 3,259 optimizer steps × 256 effective sequences × 2,048 tokens = 1,708,916,224 position tokens ``` The 6.84B figure appears to be token-superposition/original-token-equivalent accounting rather than simple next-token position accounting. A full reproducibility report should define whether the total counts original text tokens, bagged targets, loss terms, or optimizer-position tokens. ### TST note The cited Token-Superposition Training paper defines TST as a two-phase method: a superposition phase that combines contiguous tokens into bags and uses a multi-hot cross-entropy objective, followed by a recovery phase that returns to ordinary next-token training. This checkpoint is described as still being in a bag-size-4 phase. That means ordinary single-stream autoregressive inference is not necessarily the final intended training distribution. Treat quality as preliminary until a bag-size-1 recovery checkpoint and benchmark results are published. ## Usage with Transformers Install the basics: ```bash pip install "transformers>=4.46" torch safetensors ``` Optional: install `flash-linear-attention` if your platform supports it. Without it, the model falls back to a pure-PyTorch reference implementation that is useful for correctness and portability but slower for long prompts. ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer repo = "joelhenwang/OdinNext-138M-Early-Checkpoint" # For reproducible experiments, replace "main" with a specific commit hash. revision = "main" device = "cuda" if torch.cuda.is_available() else "cpu" dtype = torch.float16 if device == "cuda" else torch.float32 tok = AutoTokenizer.from_pretrained(repo, revision=revision) model = AutoModelForCausalLM.from_pretrained( repo, revision=revision, trust_remote_code=True, torch_dtype=dtype, ).to(device).eval() prompt = "The night was quiet and the streets were empty" inputs = tok(prompt, return_tensors="pt").to(device) # The released code is capped at 2,048 cumulative positions. remaining = model.config.max_position_embeddings - inputs.input_ids.shape[1] max_new_tokens = max(0, min(80, remaining)) with torch.inference_mode(): out = model.generate( **inputs, max_new_tokens=max_new_tokens, do_sample=True, temperature=0.8, top_p=0.95, repetition_penalty=1.1, pad_token_id=tok.pad_token_id, use_cache=True, ) print(tok.decode(out[0], skip_special_tokens=True)) ``` ### Batching guidance The model's recurrent scan does not apply an attention mask. For correct batched generation: - avoid left padding; - prefer same-length prompts in a batch; - avoid processing pad tokens as if they were real prompt tokens; - test batched output against single-sample output before relying on batched generation. Single-prompt generation is the safest path for basic use. ## Known limitations - **No instruction tuning:** no SFT, DPO, RLHF, RLAIF, or chat template. - **No safety training:** outputs can be unsafe, biased, false, or incoherent. - **Early quality:** this is about 3% of the planned pretraining budget according to the original release notes. - **No formal benchmarks yet:** HellaSwag, ARC, MMLU, perplexity suites, and long-context tests are not provided here. - **Hard 2,048-token cap:** recurrent cache size is constant, but the released RoPE cache still limits positions. - **Masking caveat:** `attention_mask` is ignored in the backbone; padding can affect recurrent state. - **English-focused:** multilingual and code generation should be assumed weak unless tested. - **bf16 unvalidated:** fp16 is the intended inference dtype for this checkpoint; CPU fallback should use fp32 for portability. - **Training data not fully documented in this card:** treat data provenance, memorization risk, and bias profile as uncharacterized unless separately documented. ## Revisions - `main`: EMA-shadowed weights from `_weights_source = ema_state_dict`; recommended for evaluation. - `live`: raw training weights at step 3,259, if this branch is retained. For reproducible experiments, pin a commit hash rather than a moving branch name. ## Citation ```bibtex @misc{odinnext_138m_early_2026, title = {OdinNext-138M-Early-Checkpoint}, author = {Wang, Joel}, year = {2026}, howpublished = {\url{https://huggingface.co/joelhenwang/OdinNext-138M-Early-Checkpoint}}, note = {Early HGRN2 recurrent language-model checkpoint} } ``` ## References - Zhen Qin, Songlin Yang, Weixuan Sun, Xuyang Shen, Dong Li, Weigao Sun, Yiran Zhong. **HGRN2: Gated Linear RNNs with State Expansion.** arXiv:2404.07904. https://arxiv.org/abs/2404.07904 - Bowen Peng, Théo Gigant, Jeffrey Quesnelle. **Efficient Pre-Training with Token Superposition.** arXiv:2605.06546. https://arxiv.org/abs/2605.06546 - Chenze Shao, Fandong Meng, Jie Zhou. **Patch-Level Training for Large Language Models.** arXiv:2407.12665. https://arxiv.org/abs/2407.12665 - Makoto Shing, Masanori Koyama, Takuya Akiba. **DiffusionBlocks: Block-wise Neural Network Training via Diffusion Interpretation.** arXiv:2506.14202. https://arxiv.org/abs/2506.14202 - Hugging Face Transformers custom-model documentation: https://huggingface.co/docs/transformers/custom_models - vLLM custom/Transformers backend documentation: https://docs.vllm.ai/en/latest/models/supported_models/ - SGLang Transformers backend documentation: https://huggingface.co/docs/transformers/en/community_integrations/sglang