---
license: apache-2.0
language:
  - en
library_name: transformers
pipeline_tag: text-generation
tags:
  - odinnext
  - hgrn2
  - linear-attention
  - recurrent
  - causal-lm
  - custom_code
  - early-checkpoint
  - fp16
  - amd
  - rocm
  - arxiv:2404.07904
  - arxiv:2605.06546
  - arxiv:2407.12665
  - arxiv:2506.14202
---

# OdinNext-138M-Early-Checkpoint

Early research checkpoint of **OdinNext**, a 138M-parameter causal language model using an HGRN2-style gated linear recurrence instead of softmax self-attention.

This is **not** a chat model and not a production release. It is an early pretraining checkpoint intended for architecture inspection, qualitative sampling, and continued research.

- **Repo:** `joelhenwang/OdinNext-138M-Early-Checkpoint`
- **Recommended revision:** `main` / EMA-shadowed weights
- **Training status:** early checkpoint at step 3,259
- **Context window:** 2,048 tokens in the released inference code
- **License:** Apache-2.0

> The model uses custom Transformers code. Loading it with `trust_remote_code=True` executes Python code from this repository. Only do this after reviewing the files or pinning a known commit.

## At a glance

| Item | Value |
|---|---:|
| Unique tied parameters | **138,449,696** |
| Non-embedding parameters | **113,283,872** |
| Layers | 16 |
| Hidden size | 768 |
| Heads | 6 |
| Head state dims | 128 × 128 per head |
| FFN inner size | 2,048 |
| Vocabulary | 32,768 custom BPE tokens |
| Max sequence length | 2,048 |
| Checkpoint dtype | fp16 |
| Architecture | HGRN2 recurrence + alternating RoPE + SwiGLU² FFN + RMSNorm-style normalization |
| Cache type | Fixed recurrent state, not a growing Transformer KV cache |

## What this checkpoint is good for

Use this checkpoint for:

- inspecting a compact recurrent/linear-attention LM implementation;
- testing HGRN2-style recurrent decoding inside the Hugging Face `generate()` API;
- studying fixed-state decoding memory behavior;
- continuing pretraining or running controlled ablations.

Do **not** use it for:

- chat, instruction following, or agentic tasks;
- safety-sensitive output generation;
- benchmark claims without running your own evaluation;
- multilingual, coding, or long-context claims.

## Architecture

OdinNext is a decoder-only causal LM. Each block uses a pre-norm residual layout:

```text
x = x + sigmoid(gate_attn) * HGRN2(norm(x))
x = x + sigmoid(gate_ffn)  * SwiGLU²(norm(x))
```

The HGRN2-style recurrent state is updated per token as:

```text
S_t = diag(exp(g_t)) S_{t-1} + k_t ⊗ v_t
o_t = q_t S_t
```

where each layer keeps a per-batch recurrent state shaped:

```text
[B, n_heads, head_f_dim, head_i_dim]
```

For this checkpoint:

```text
n_heads    = 6
head_f_dim = 128
head_i_dim = 128
```

Even-numbered layers apply RoPE to `q` and `k`; odd-numbered layers are position-free. The current inference implementation still enforces a hard 2,048-token cumulative position limit because the RoPE cache is built for `max_seq_len = 2048`.

### Important implementation details

- The exported Hugging Face code contains only the inference path. Training-time machinery is not part of this repository.
- `past_key_values` is an `OdinNextCache`, a list of recurrent states. It is **not** a Transformer KV cache.
- `attention_mask` is accepted for API compatibility but ignored by the backbone. Left-padding is not supported.
- Batched generation is safest when all prompts have the same valid length. Padding tokens are still tokens to the recurrence if they are processed.
- `use_cache=True` is important for generation. Without it, every generation step reprocesses the full prefix.

## Parameter accounting

The 138M headline is the **unique tied-parameter runtime count**. The input embedding and LM head are tied and should be counted once for model-capacity comparisons.


Hugging Face or file-size-derived parameter summaries may round this checkpoint near 0.2B because stored checkpoint tensors and tied runtime parameters are not always counted the same way.

## Memory: recurrent state vs Transformer KV cache

For batch size 1 in fp16, OdinNext's recurrent state size is:

```text
layers × heads × head_f_dim × head_i_dim × bytes
= 16 × 6 × 128 × 128 × 2
= 3,145,728 bytes ≈ 3.0 MiB
```

That state is constant with respect to generated context length. It scales linearly with batch size and with dtype size. In the pure-PyTorch fallback path, the scan state is promoted to fp32, so the returned recurrent state can be about **6.0 MiB per sequence** instead of 3.0 MiB.

A same-depth 16-layer, `d_model = 768`, fp16 Transformer with full multi-head K/V cache would use approximately:

```text
layers × 2(K,V) × hidden_size × context_tokens × bytes
= 16 × 2 × 768 × T × 2
```

| Context tokens | Typical Transformer KV cache | OdinNext recurrent state |
|---:|---:|---:|
| 1,024 | 48 MiB | ~3 MiB fp16 / ~6 MiB fp32 fallback |
| 4,096 | 192 MiB | ~3 MiB fp16 / ~6 MiB fp32 fallback |
| 16,384 | 768 MiB | ~3 MiB fp16 / ~6 MiB fp32 fallback |
| 65,536 | 3,072 MiB | ~3 MiB fp16 / ~6 MiB fp32 fallback |

This table is a cache-state comparison only. It is not a claim about total GPU memory, throughput, benchmark quality, or usable context length. The released OdinNext code is still limited to 2,048 cumulative positions.

## Training snapshot

Values verified from the public config:

| Field | Value |
|---|---:|
| `_training_step` | 3,259 |
| `_total_tokens` | 6,835,666,944 |
| `_weights_source` | `ema_state_dict` |
| `torch_dtype` | `float16` |
| `max_position_embeddings` | 2,048 |

Author-reported training notes for this early checkpoint:

| Item | Value |
|---|---|
| Hardware | 2× AMD Strix Halo / gfx1151, ROCm stack |
| Training precision | fp16 + GradScaler |
| Optimizers | NorMuon for 2D tensors; AdamW for 1D/embed tensors |
| LR schedule | WSD, peak `8e-4`, warmup 500, min LR 0.1× peak |
| Stabilization | z-loss `1e-4`, attention soft-cap 50, EMA decay 0.999 |
| Curriculum | TST-style bag-size-4 phase active at this checkpoint |
| Public benchmarks | not yet provided |

### Token accounting note

The public config records `_total_tokens = 6,835,666,944`. Do not reinterpret that as plain next-token positions from:

```text
3,259 optimizer steps × 256 effective sequences × 2,048 tokens
= 1,708,916,224 position tokens
```

The 6.84B figure appears to be token-superposition/original-token-equivalent accounting rather than simple next-token position accounting. A full reproducibility report should define whether the total counts original text tokens, bagged targets, loss terms, or optimizer-position tokens.

### TST note

The cited Token-Superposition Training paper defines TST as a two-phase method: a superposition phase that combines contiguous tokens into bags and uses a multi-hot cross-entropy objective, followed by a recovery phase that returns to ordinary next-token training.

This checkpoint is described as still being in a bag-size-4 phase. That means ordinary single-stream autoregressive inference is not necessarily the final intended training distribution. Treat quality as preliminary until a bag-size-1 recovery checkpoint and benchmark results are published.

## Usage with Transformers

Install the basics:

```bash
pip install "transformers>=4.46" torch safetensors
```

Optional: install `flash-linear-attention` if your platform supports it. Without it, the model falls back to a pure-PyTorch reference implementation that is useful for correctness and portability but slower for long prompts.

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

repo = "joelhenwang/OdinNext-138M-Early-Checkpoint"
# For reproducible experiments, replace "main" with a specific commit hash.
revision = "main"

device = "cuda" if torch.cuda.is_available() else "cpu"
dtype = torch.float16 if device == "cuda" else torch.float32

tok = AutoTokenizer.from_pretrained(repo, revision=revision)
model = AutoModelForCausalLM.from_pretrained(
    repo,
    revision=revision,
    trust_remote_code=True,
    torch_dtype=dtype,
).to(device).eval()

prompt = "The night was quiet and the streets were empty"
inputs = tok(prompt, return_tensors="pt").to(device)

# The released code is capped at 2,048 cumulative positions.
remaining = model.config.max_position_embeddings - inputs.input_ids.shape[1]
max_new_tokens = max(0, min(80, remaining))

with torch.inference_mode():
    out = model.generate(
        **inputs,
        max_new_tokens=max_new_tokens,
        do_sample=True,
        temperature=0.8,
        top_p=0.95,
        repetition_penalty=1.1,
        pad_token_id=tok.pad_token_id,
        use_cache=True,
    )

print(tok.decode(out[0], skip_special_tokens=True))
```

### Batching guidance

The model's recurrent scan does not apply an attention mask. For correct batched generation:

- avoid left padding;
- prefer same-length prompts in a batch;
- avoid processing pad tokens as if they were real prompt tokens;
- test batched output against single-sample output before relying on batched generation.

Single-prompt generation is the safest path for basic use.


## Known limitations

- **No instruction tuning:** no SFT, DPO, RLHF, RLAIF, or chat template.
- **No safety training:** outputs can be unsafe, biased, false, or incoherent.
- **Early quality:** this is about 3% of the planned pretraining budget according to the original release notes.
- **No formal benchmarks yet:** HellaSwag, ARC, MMLU, perplexity suites, and long-context tests are not provided here.
- **Hard 2,048-token cap:** recurrent cache size is constant, but the released RoPE cache still limits positions.
- **Masking caveat:** `attention_mask` is ignored in the backbone; padding can affect recurrent state.
- **English-focused:** multilingual and code generation should be assumed weak unless tested.
- **bf16 unvalidated:** fp16 is the intended inference dtype for this checkpoint; CPU fallback should use fp32 for portability.
- **Training data not fully documented in this card:** treat data provenance, memorization risk, and bias profile as uncharacterized unless separately documented.

## Revisions

- `main`: EMA-shadowed weights from `_weights_source = ema_state_dict`; recommended for evaluation.
- `live`: raw training weights at step 3,259, if this branch is retained.

For reproducible experiments, pin a commit hash rather than a moving branch name.

## Citation

```bibtex
@misc{odinnext_138m_early_2026,
  title        = {OdinNext-138M-Early-Checkpoint},
  author       = {Wang, Joel},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/joelhenwang/OdinNext-138M-Early-Checkpoint}},
  note         = {Early HGRN2 recurrent language-model checkpoint}
}
```

## References

- Zhen Qin, Songlin Yang, Weixuan Sun, Xuyang Shen, Dong Li, Weigao Sun, Yiran Zhong. **HGRN2: Gated Linear RNNs with State Expansion.** arXiv:2404.07904. https://arxiv.org/abs/2404.07904
- Bowen Peng, Théo Gigant, Jeffrey Quesnelle. **Efficient Pre-Training with Token Superposition.** arXiv:2605.06546. https://arxiv.org/abs/2605.06546
- Chenze Shao, Fandong Meng, Jie Zhou. **Patch-Level Training for Large Language Models.** arXiv:2407.12665. https://arxiv.org/abs/2407.12665
- Makoto Shing, Masanori Koyama, Takuya Akiba. **DiffusionBlocks: Block-wise Neural Network Training via Diffusion Interpretation.** arXiv:2506.14202. https://arxiv.org/abs/2506.14202
- Hugging Face Transformers custom-model documentation: https://huggingface.co/docs/transformers/custom_models
- vLLM custom/Transformers backend documentation: https://docs.vllm.ai/en/latest/models/supported_models/
- SGLang Transformers backend documentation: https://huggingface.co/docs/transformers/en/community_integrations/sglang