Instructions to use joelhenwang/OdinNext-138M-Early-Checkpoint with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use joelhenwang/OdinNext-138M-Early-Checkpoint with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="joelhenwang/OdinNext-138M-Early-Checkpoint", trust_remote_code=True)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("joelhenwang/OdinNext-138M-Early-Checkpoint", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use joelhenwang/OdinNext-138M-Early-Checkpoint with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "joelhenwang/OdinNext-138M-Early-Checkpoint"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "joelhenwang/OdinNext-138M-Early-Checkpoint",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/joelhenwang/OdinNext-138M-Early-Checkpoint

SGLang

How to use joelhenwang/OdinNext-138M-Early-Checkpoint with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "joelhenwang/OdinNext-138M-Early-Checkpoint" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "joelhenwang/OdinNext-138M-Early-Checkpoint",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "joelhenwang/OdinNext-138M-Early-Checkpoint" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "joelhenwang/OdinNext-138M-Early-Checkpoint",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use joelhenwang/OdinNext-138M-Early-Checkpoint with Docker Model Runner:
```
docker model run hf.co/joelhenwang/OdinNext-138M-Early-Checkpoint
```

joelhenwang commited on 11 days ago

Commit

51b0052

verified ·

1 Parent(s): 244ade3

Update README.md

Browse files

Files changed (1) hide show

README.md +239 -90

README.md CHANGED Viewed

@@ -1,142 +1,291 @@
 ---
 license: apache-2.0
 library_name: transformers
 pipeline_tag: text-generation
-language:
-- en
 tags:
-- odinnext
-- hgrn2
-- linear-attention
-- recurrent
-- custom_code
-- early-checkpoint
-- causal-lm
-- amd
-- rocm
 ---
 # OdinNext-138M-Early-Checkpoint
-Early-stage checkpoint of **OdinNext**, a 138M-parameter HGRN2 linear-attention LM trained from scratch on AMD Strix Halo (gfx1151). **6.84B tokens, ~3% of the planned pretraining budget.** Still in active development — output quality is weak, no SFT, no alignment, no context extension.
-> **Variant**: `main (EMA)` — EMA-shadowed weights (decay 0.999), recommended for evaluation. See the [`live` revision](https://huggingface.co/joelhenwang/OdinNext-138M-Early-Checkpoint/tree/live) for the raw training weights.
 ## At a glance
-| | |
-|---|---|
-| Params | 138.4M (113.3M non-embedding) |
-| Architecture | 16 layers, HGRN2 + alternating RoPE, SwiGLU², ZCRMSNorm |
-| Hidden / Heads / FFN | 768 / 6 / 2048 |
-| Vocab / Context | 32,768 (custom BPE) / 2,048 |
-| Inference | **O(1) per token, fixed 3 MB recurrent state** (no growing KV cache) |
-| Training | fp16 + GradScaler, NorMuon (2D) + AdamW (1D/embed), WSD, EMA 0.999 |
-| Curriculum | TST bag-size-4 active throughout this checkpoint |
-| Hardware | 2× AMD Strix Halo (gfx1151), ROCm 7.13, gloo over Thunderbolt 4 |
 ## Architecture
-OdinNext replaces softmax attention with the **HGRN2 gated linear recurrence** [1]:
-`S_t = diag(exp(g_t)) · S_{t-1} + k_t ⊗ v_t`,  `o_t = q_t · S_t`. The state is a fixed-size matrix updated in place, so per-token decode is O(1) in compute and memory regardless of context length.
-Sixteen identical pre-norm blocks: `x + σ(gate_attn) · HGRN2(ZCRMSNorm(x))`, `x + σ(gate_ffn) · SwiGLU²(ZCRMSNorm(x))`. Tied embeddings + LM head. No biases on linear layers.
-**Hybrid RoPE**: even layers apply RoPE on q/k (θ=100,000); odd layers are position-free. Half the depth thus generalizes to arbitrary length without ABF, simplifying future context extension.
-### Decisions and why
-Choices below come from 25+ ablations on a 100M proxy model; only the BPB-winning configuration shipped.
-- **Linear attention (HGRN2) over softmax**: gfx1151 has no MFMA tensor cores. Custom HIP attention can't beat rocBLAS, and softmax attention's O(T²) memory dominates step time on this platform. HGRN2 is dominated by element-wise ops + small GEMMs that fit rocBLAS-friendly shapes.
-- **fp16, not bf16**: bf16 GEMMs are 24% slower on gfx1151 and trigger Inductor crashes under `torch.compile`. fp16 + GradScaler + z-loss + activation soft-cap is stable.
-- **SwiGLU² over SwiGLU**: −0.009 BPB at iso-parameter count. The squared SiLU gate gives sharper sparsity with smooth gradients.
-- **ZCRMSNorm + zero-init gates**: block at init is approximately identity (γ=0, σ(0)=0.5). Loss starts at ≈ln(V), no spike-and-recover phase. Required for future block-wise denoising training [3].
-- **NorMuon (2D, fp16 NS) + AdamW (1D, embed @ 0.3× LR)**: each parameter group gets the right update rule for its geometry; Newton-Schulz in fp16 is ~10× faster than fp32 on this platform with no measurable quality loss.
-- **TST bag-size-4 curriculum** [2]: every position averages 4 stochastic subword tokenizations of the same text. Forces tokenization-invariant representations early. **Note**: this checkpoint is fully pre-transition (still bagged) → single-stream inference is slightly OOD. Quality is expected to lift after the planned bag-size→1 transition.
-## Training
-| | |
-|---|---|
-| Batch | 32 seqs × 4 grad-accum × 2 ranks = 256 effective sequences (524,288 tokens/step) |
-| Optimizer steps | 3,259 |
-| LR schedule | WSD, peak 8e-4 (NorMuon), warmup 500, MIN_LR 0.1× |
-| Stability | z-loss 1e-4, attn-softcap 50, EMA decay 0.999, GradScaler growth 500 |
-| Compile | `max-autotune-no-cudagraphs`, per-layer (`compile_zones`) |
-| Throughput | ~427K tokens/s aggregate across 2 nodes |
-| Run health | 0 NaN events; GradScaler scale 1024→65,536 cleanly |
-| Final loss / BPB (step 3,200) | 1.886 / 0.755 |
-## Memory: HGRN2 state vs Transformer KV cache
-| Context | Transformer KV (typical d=768) | OdinNext HGRN2 |
 |---:|---:|---:|
-| 1K | ~24 MB | **~3 MB** |
-| 4K | ~96 MB | **~3 MB** |
-| 16K | ~384 MB | **~3 MB** |
-| 64K | ~1.5 GB | **~3 MB** |
-State size: `n_layers × n_heads × head_f_dim × head_i_dim × 2 bytes` ≈ 3 MB, **constant**.
-## Usage
 ```python
-from transformers import AutoModelForCausalLM, AutoTokenizer
 import torch
-name = "joelhenwang/OdinNext-138M-Early-Checkpoint"
-tok = AutoTokenizer.from_pretrained(name, trust_remote_code=True)
 model = AutoModelForCausalLM.from_pretrained(
-    name, trust_remote_code=True, torch_dtype=torch.float16
-).to("cuda" if torch.cuda.is_available() else "cpu").eval()
-inputs = tok("The night was quiet and the streets were empty", return_tensors="pt").to(model.device)
 with torch.inference_mode():
     out = model.generate(
-        **inputs, max_new_tokens=80, do_sample=True,
-        temperature=0.8, top_p=0.95, repetition_penalty=1.1,
-        pad_token_id=tok.pad_token_id, use_cache=True,
     )
 print(tok.decode(out[0], skip_special_tokens=True))
 ```
-- `use_cache=True` is essential — without it, the model re-processes the full prefix each step.
-- `past_key_values` is **not** a KV cache; it's a fixed-size HGRN2 state (`OdinNextCache`).
-- Hard cap at 2,048 cumulative positions. Recurrence is causal-only — for batched generation, **right-pad**.
-- [`flash-linear-attention`](https://github.com/sustcsonglin/flash-linear-attention) is recommended (~10–30× faster Triton kernels). The model auto-falls-back to a pure-PyTorch reference if `fla` is unavailable.
-## Caveats
-- ❌ No SFT, no DPO/RLHF, no chat template, no safety training.
-- ❌ No context extension (max 2,048 tokens).
-- ❌ English-only mixture; multilingual and code outputs will be poor.
-- ❌ TST bagging still active → expect a quality jump at the planned bag→1 transition.
-- ❌ bf16 inference untested on this checkpoint.
-- ❌ Formal benchmarks (HellaSwag, ARC, etc.) pending.
-## Revisions
-- **`main`** — EMA-shadowed weights (decay 0.999). Recommended for evaluation.
-- **`live`** — raw training weights at step 3,259.
-## License
-Apache-2.0.
 ## Citation
 ```bibtex
-@misc{{odinnext_138m_early_2026,
-  title  = {{OdinNext-138M-Early-Checkpoint}},
-  author = {{Wang, Joel}},
-  year   = {{2026}},
-  url    = {{https://huggingface.co/joelhenwang/OdinNext-138M-Early-Checkpoint}},
-}}
 ```
 ## References
-[1] Qin, Yang, Sun, et al. **HGRN2: Gated Linear RNNs with State Expansion.** arXiv:2404.07904, 2024.
-[2] **Token Superposition Training (TST).** arXiv:2605.06546. (Related: PatchTrain, Shao et al., arXiv:2407.12665, ICLR 2025.)
-[3] **DiffusionBlocks** — block-wise training via score-matching denoising (Iizuka et al., 2025). Used in the planned post-this-checkpoint phase, not in this run.

 ---
 license: apache-2.0
+language:
+  - en
 library_name: transformers
 pipeline_tag: text-generation
 tags:
+  - odinnext
+  - hgrn2
+  - linear-attention
+  - recurrent
+  - causal-lm
+  - custom_code
+  - early-checkpoint
+  - fp16
+  - amd
+  - rocm
+  - arxiv:2404.07904
+  - arxiv:2605.06546
+  - arxiv:2407.12665
+  - arxiv:2506.14202
 ---
 # OdinNext-138M-Early-Checkpoint
+Early research checkpoint of **OdinNext**, a 138M-parameter causal language model using an HGRN2-style gated linear recurrence instead of softmax self-attention.
+This is **not** a chat model and not a production release. It is an early pretraining checkpoint intended for architecture inspection, qualitative sampling, and continued research.
+- **Repo:** `joelhenwang/OdinNext-138M-Early-Checkpoint`
+- **Recommended revision:** `main` / EMA-shadowed weights
+- **Training status:** early checkpoint at step 3,259
+- **Context window:** 2,048 tokens in the released inference code
+- **License:** Apache-2.0
+> The model uses custom Transformers code. Loading it with `trust_remote_code=True` executes Python code from this repository. Only do this after reviewing the files or pinning a known commit.
 ## At a glance
+| Item | Value |
+|---|---:|
+| Unique tied parameters | **138,449,696** |
+| Non-embedding parameters | **113,283,872** |
+| Layers | 16 |
+| Hidden size | 768 |
+| Heads | 6 |
+| Head state dims | 128 × 128 per head |
+| FFN inner size | 2,048 |
+| Vocabulary | 32,768 custom BPE tokens |
+| Max sequence length | 2,048 |
+| Checkpoint dtype | fp16 |
+| Architecture | HGRN2 recurrence + alternating RoPE + SwiGLU² FFN + RMSNorm-style normalization |
+| Cache type | Fixed recurrent state, not a growing Transformer KV cache |
+## What this checkpoint is good for
+Use this checkpoint for:
+- inspecting a compact recurrent/linear-attention LM implementation;
+- testing HGRN2-style recurrent decoding inside the Hugging Face `generate()` API;
+- studying fixed-state decoding memory behavior;
+- continuing pretraining or running controlled ablations.
+Do **not** use it for:
+- chat, instruction following, or agentic tasks;
+- safety-sensitive output generation;
+- benchmark claims without running your own evaluation;
+- multilingual, coding, or long-context claims.
 ## Architecture
+OdinNext is a decoder-only causal LM. Each block uses a pre-norm residual layout:
+```text
+x = x + sigmoid(gate_attn) * HGRN2(norm(x))
+x = x + sigmoid(gate_ffn)  * SwiGLU²(norm(x))
+```
+The HGRN2-style recurrent state is updated per token as:
+```text
+S_t = diag(exp(g_t)) S_{t-1} + k_t ⊗ v_t
+o_t = q_t S_t
+```
+where each layer keeps a per-batch recurrent state shaped:
+```text
+[B, n_heads, head_f_dim, head_i_dim]
+```
+For this checkpoint:
+```text
+n_heads    = 6
+head_f_dim = 128
+head_i_dim = 128
+```
+Even-numbered layers apply RoPE to `q` and `k`; odd-numbered layers are position-free. The current inference implementation still enforces a hard 2,048-token cumulative position limit because the RoPE cache is built for `max_seq_len = 2048`.
+### Important implementation details
+- The exported Hugging Face code contains only the inference path. Training-time machinery is not part of this repository.
+- `past_key_values` is an `OdinNextCache`, a list of recurrent states. It is **not** a Transformer KV cache.
+- `attention_mask` is accepted for API compatibility but ignored by the backbone. Left-padding is not supported.
+- Batched generation is safest when all prompts have the same valid length. Padding tokens are still tokens to the recurrence if they are processed.
+- `use_cache=True` is important for generation. Without it, every generation step reprocesses the full prefix.
+## Parameter accounting
+The 138M headline is the **unique tied-parameter runtime count**. The input embedding and LM head are tied and should be counted once for model-capacity comparisons.
+Hugging Face or file-size-derived parameter summaries may round this checkpoint near 0.2B because stored checkpoint tensors and tied runtime parameters are not always counted the same way.
+## Memory: recurrent state vs Transformer KV cache
+For batch size 1 in fp16, OdinNext's recurrent state size is:
+```text
+layers × heads × head_f_dim × head_i_dim × bytes
+= 16 × 6 × 128 × 128 × 2
+= 3,145,728 bytes ≈ 3.0 MiB
+```
+That state is constant with respect to generated context length. It scales linearly with batch size and with dtype size. In the pure-PyTorch fallback path, the scan state is promoted to fp32, so the returned recurrent state can be about **6.0 MiB per sequence** instead of 3.0 MiB.
+A same-depth 16-layer, `d_model = 768`, fp16 Transformer with full multi-head K/V cache would use approximately:
+```text
+layers × 2(K,V) × hidden_size × context_tokens × bytes
+= 16 × 2 × 768 × T × 2
+```
+| Context tokens | Typical Transformer KV cache | OdinNext recurrent state |
 |---:|---:|---:|
+| 1,024 | 48 MiB | ~3 MiB fp16 / ~6 MiB fp32 fallback |
+| 4,096 | 192 MiB | ~3 MiB fp16 / ~6 MiB fp32 fallback |
+| 16,384 | 768 MiB | ~3 MiB fp16 / ~6 MiB fp32 fallback |
+| 65,536 | 3,072 MiB | ~3 MiB fp16 / ~6 MiB fp32 fallback |
+This table is a cache-state comparison only. It is not a claim about total GPU memory, throughput, benchmark quality, or usable context length. The released OdinNext code is still limited to 2,048 cumulative positions.
+## Training snapshot
+Values verified from the public config:
+| Field | Value |
+|---|---:|
+| `_training_step` | 3,259 |
+| `_total_tokens` | 6,835,666,944 |
+| `_weights_source` | `ema_state_dict` |
+| `torch_dtype` | `float16` |
+| `max_position_embeddings` | 2,048 |
+Author-reported training notes for this early checkpoint:
+| Item | Value |
+|---|---|
+| Hardware | 2× AMD Strix Halo / gfx1151, ROCm stack |
+| Training precision | fp16 + GradScaler |
+| Optimizers | NorMuon for 2D tensors; AdamW for 1D/embed tensors |
+| LR schedule | WSD, peak `8e-4`, warmup 500, min LR 0.1× peak |
+| Stabilization | z-loss `1e-4`, attention soft-cap 50, EMA decay 0.999 |
+| Curriculum | TST-style bag-size-4 phase active at this checkpoint |
+| Public benchmarks | not yet provided |
+### Token accounting note
+The public config records `_total_tokens = 6,835,666,944`. Do not reinterpret that as plain next-token positions from:
+```text
+3,259 optimizer steps × 256 effective sequences × 2,048 tokens
+= 1,708,916,224 position tokens
+```
+The 6.84B figure appears to be token-superposition/original-token-equivalent accounting rather than simple next-token position accounting. A full reproducibility report should define whether the total counts original text tokens, bagged targets, loss terms, or optimizer-position tokens.
+### TST note
+The cited Token-Superposition Training paper defines TST as a two-phase method: a superposition phase that combines contiguous tokens into bags and uses a multi-hot cross-entropy objective, followed by a recovery phase that returns to ordinary next-token training.
+This checkpoint is described as still being in a bag-size-4 phase. That means ordinary single-stream autoregressive inference is not necessarily the final intended training distribution. Treat quality as preliminary until a bag-size-1 recovery checkpoint and benchmark results are published.
+## Usage with Transformers
+Install the basics:
+```bash
+pip install "transformers>=4.46" torch safetensors
+```
+Optional: install `flash-linear-attention` if your platform supports it. Without it, the model falls back to a pure-PyTorch reference implementation that is useful for correctness and portability but slower for long prompts.
 ```python
 import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+repo = "joelhenwang/OdinNext-138M-Early-Checkpoint"
+# For reproducible experiments, replace "main" with a specific commit hash.
+revision = "main"
+device = "cuda" if torch.cuda.is_available() else "cpu"
+dtype = torch.float16 if device == "cuda" else torch.float32
+tok = AutoTokenizer.from_pretrained(repo, revision=revision)
 model = AutoModelForCausalLM.from_pretrained(
+    repo,
+    revision=revision,
+    trust_remote_code=True,
+    torch_dtype=dtype,
+).to(device).eval()
+prompt = "The night was quiet and the streets were empty"
+inputs = tok(prompt, return_tensors="pt").to(device)
+# The released code is capped at 2,048 cumulative positions.
+remaining = model.config.max_position_embeddings - inputs.input_ids.shape[1]
+max_new_tokens = max(0, min(80, remaining))
 with torch.inference_mode():
     out = model.generate(
+        **inputs,
+        max_new_tokens=max_new_tokens,
+        do_sample=True,
+        temperature=0.8,
+        top_p=0.95,
+        repetition_penalty=1.1,
+        pad_token_id=tok.pad_token_id,
+        use_cache=True,
     )
 print(tok.decode(out[0], skip_special_tokens=True))
 ```
+### Batching guidance
+The model's recurrent scan does not apply an attention mask. For correct batched generation:
+- avoid left padding;
+- prefer same-length prompts in a batch;
+- avoid processing pad tokens as if they were real prompt tokens;
+- test batched output against single-sample output before relying on batched generation.
+Single-prompt generation is the safest path for basic use.
+## Known limitations
+- **No instruction tuning:** no SFT, DPO, RLHF, RLAIF, or chat template.
+- **No safety training:** outputs can be unsafe, biased, false, or incoherent.
+- **Early quality:** this is about 3% of the planned pretraining budget according to the original release notes.
+- **No formal benchmarks yet:** HellaSwag, ARC, MMLU, perplexity suites, and long-context tests are not provided here.
+- **Hard 2,048-token cap:** recurrent cache size is constant, but the released RoPE cache still limits positions.
+- **Masking caveat:** `attention_mask` is ignored in the backbone; padding can affect recurrent state.
+- **English-focused:** multilingual and code generation should be assumed weak unless tested.
+- **bf16 unvalidated:** fp16 is the intended inference dtype for this checkpoint; CPU fallback should use fp32 for portability.
+- **Training data not fully documented in this card:** treat data provenance, memorization risk, and bias profile as uncharacterized unless separately documented.
+## Revisions
+- `main`: EMA-shadowed weights from `_weights_source = ema_state_dict`; recommended for evaluation.
+- `live`: raw training weights at step 3,259, if this branch is retained.
+For reproducible experiments, pin a commit hash rather than a moving branch name.
 ## Citation
 ```bibtex
+@misc{odinnext_138m_early_2026,
+  title        = {OdinNext-138M-Early-Checkpoint},
+  author       = {Wang, Joel},
+  year         = {2026},
+  howpublished = {\url{https://huggingface.co/joelhenwang/OdinNext-138M-Early-Checkpoint}},
+  note         = {Early HGRN2 recurrent language-model checkpoint}
+}
 ```
 ## References
+- Zhen Qin, Songlin Yang, Weixuan Sun, Xuyang Shen, Dong Li, Weigao Sun, Yiran Zhong. **HGRN2: Gated Linear RNNs with State Expansion.** arXiv:2404.07904. https://arxiv.org/abs/2404.07904
+- Bowen Peng, Théo Gigant, Jeffrey Quesnelle. **Efficient Pre-Training with Token Superposition.** arXiv:2605.06546. https://arxiv.org/abs/2605.06546
+- Chenze Shao, Fandong Meng, Jie Zhou. **Patch-Level Training for Large Language Models.** arXiv:2407.12665. https://arxiv.org/abs/2407.12665
+- Makoto Shing, Masanori Koyama, Takuya Akiba. **DiffusionBlocks: Block-wise Neural Network Training via Diffusion Interpretation.** arXiv:2506.14202. https://arxiv.org/abs/2506.14202
+- Hugging Face Transformers custom-model documentation: https://huggingface.co/docs/transformers/custom_models
+- vLLM custom/Transformers backend documentation: https://docs.vllm.ai/en/latest/models/supported_models/
+- SGLang Transformers backend documentation: https://huggingface.co/docs/transformers/en/community_integrations/sglang