File size: 12,022 Bytes
c34c933 3f0c8ee c34c933 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 | ---
license: apache-2.0
language:
- en
library_name: pytorch
pipeline_tag: text-generation
tags:
- instruct
- sft
- transformer
- PolyGLU
- activation-routing
- math
- research
- from-scratch
base_model: tylerxdurden/PolyChromaticLM-1.0-base-0.6B
model-index:
- name: PolyChromaticLM-1.0-instruct-0.6B
results:
- task:
type: multiple-choice
name: HellaSwag
dataset:
name: HellaSwag
type: hellaswag
metrics:
- type: acc_norm
value: 27.84
name: Normalized Accuracy
- task:
type: multiple-choice
name: ARC-Easy
dataset:
name: ARC-Easy
type: ai2_arc
config: ARC-Easy
metrics:
- type: acc_norm
value: 36.11
name: Normalized Accuracy
- task:
type: multiple-choice
name: ARC-Challenge
dataset:
name: ARC-Challenge
type: ai2_arc
config: ARC-Challenge
metrics:
- type: acc_norm
value: 24.15
name: Normalized Accuracy
- task:
type: multiple-choice
name: PIQA
dataset:
name: PIQA
type: piqa
metrics:
- type: acc_norm
value: 54.52
name: Normalized Accuracy
- task:
type: multiple-choice
name: WinoGrande
dataset:
name: WinoGrande
type: winogrande
metrics:
- type: acc
value: 52.72
name: Accuracy
- task:
type: multiple-choice
name: BoolQ
dataset:
name: BoolQ
type: boolq
metrics:
- type: acc
value: 55.63
name: Accuracy
- task:
type: multiple-choice
name: SciQ
dataset:
name: SciQ
type: sciq
metrics:
- type: acc_norm
value: 52.70
name: Normalized Accuracy
- task:
type: multiple-choice
name: MMLU-STEM
dataset:
name: MMLU-STEM
type: mmlu
config: stem
metrics:
- type: acc
value: 28.42
name: Accuracy (5-shot)
---
<div align="center">
# PolyChromaticLM 1.0 Instruct (0.6B)
**A 597M-parameter transformer with biologically-inspired activation routing, fine-tuned for mathematical reasoning**
*SFT on ~347K math problems from Nemotron-Math-v2, with chain-of-thought solutions in ChatML format.*
[](https://arxiv.org/)
[](https://github.com/danielxmed/PolyGLU)
[](https://www.apache.org/licenses/LICENSE-2.0)
[](https://huggingface.co/tylerxdurden/PolyChromaticLM-1.0-base-0.6B)
</div>
---
## Overview
This is the **SFT (instruction-tuned) version** of [PolyChromaticLM-1.0-base-0.6B](https://huggingface.co/tylerxdurden/PolyChromaticLM-1.0-base-0.6B), fine-tuned on mathematical problem-solving data with chain-of-thought reasoning in ChatML format.
The core innovation is **PolyGLU** (Polychromatic Gated Linear Unit) β a drop-in SwiGLU replacement that implements **state-conditional activation routing**. Each FFN neuron dynamically selects among K=4 activation functions (ReLU, Tanh, SiLU, GELU) via a differentiable Gumbel-Softmax mechanism.
**Author**: Daniel Nobrega (independent research)
### Key SFT Results
- **Training loss**: 1.77 β 0.91 (48.7% reduction over 1 epoch)
- **Routing entropy: 1.386 (maximum) throughout all 13,067 SFT steps** β the PolyGLU routing architecture is fully robust to fine-tuning
- **MMLU-STEM improved by +3.14 pp** after SFT, with large gains on quantitative subtasks (High School Statistics +20.84 pp, College Mathematics +11.00 pp)
- Moderate forgetting on general benchmarks (mean -2.89 pp across 10 tasks) β 9/10 benchmarks remain above random
---
## SFT Training
| | |
|---|---|
| **Base checkpoint** | [`PolyChromaticLM-1.0-base-0.6B`](https://huggingface.co/tylerxdurden/PolyChromaticLM-1.0-base-0.6B) (step 19,531, 10.24B tokens) |
| **SFT dataset** | [`nvidia/Nemotron-Math-v2`](https://huggingface.co/datasets/nvidia/Nemotron-Math-v2) (high_part00, ~347K problems) |
| **Format** | ChatML with assistant-only loss masking |
| **Epochs** | 1 |
| **Optimizer** | AdamW (beta1=0.9, beta2=0.95, eps=1e-8) |
| **Peak LR** | 2e-5 (cosine decay, 100-step warmup) |
| **Effective batch** | ~524K tokens (micro_batch=2, grad_accum=16) |
| **Gumbel-Softmax tau** | 0.1 (frozen from pre-training) |
| **Steps** | 13,067 |
| **Hardware** | 1x NVIDIA A100 80GB |
| **Duration** | ~18 hours |
| **Compute cost** | ~$29.50 |
| **Mean throughput** | ~11,447 tok/s |
### Training Dynamics
<div align="center">
<img src="figures/sft_training_dynamics.png" alt="SFT training dynamics: loss curve, learning rate, and throughput" width="90%">
</div>
<details>
<summary><b>Loss curve detail</b></summary>
<img src="figures/sft_loss_curve.png" alt="SFT loss curve from 1.77 to 0.91" width="80%">
| Step | Loss |
|-----:|-----:|
| 10 | 1.77 |
| 500 | ~1.10 |
| 5,000 | ~0.95 |
| 10,000 | ~0.90 |
| 13,067 | **0.91** |
</details>
### Routing Entropy Stability
The most remarkable observation: **routing entropy remained at exactly 1.386 (= ln(4) = maximum entropy for K=4) throughout all 13,067 SFT steps.** This means:
- Static routing preferences learned during pre-training were NOT disturbed by SFT
- PolyGLU neurons maintained equal activation diversity across all 4 functions
- The routing architecture is **robust to fine-tuning** β a critical validation of the design
SFT modifies *what* is computed, not *how*: the routing mechanism (which activation function each neuron uses) remains unchanged, while the model's weights adapt to produce chain-of-thought reasoning.
---
## Evaluation
All benchmarks via [EleutherAI lm-evaluation-harness](https://github.com/EleutherAI/lm-eval) v0.4.11, 0-shot unless noted.
### Benchmarks (Base vs SFT vs Qwen3-0.6B-Base)
| Benchmark | Metric | Base | SFT | Delta | Random | Qwen3-0.6B |
|-----------|--------|-----:|----:|------:|-------:|-----------:|
| **HellaSwag** | acc_norm | 28.51 | 27.84 | -0.67 | 25.00 | 41.10 |
| **ARC-Easy** | acc_norm | 41.04 | 36.11 | -4.93 | 25.00 | 65.60 |
| **ARC-Challenge** | acc_norm | 22.27 | 24.15 | +1.88 | 25.00 | 33.90 |
| **PIQA** | acc_norm | 58.87 | 54.52 | -4.35 | 50.00 | 70.00 |
| **WinoGrande** | acc | 52.17 | 52.72 | +0.55 | 50.00 | 58.50 |
| **BoolQ** | acc | 61.13 | 55.63 | -5.50 | 50.00 | 69.70 |
| **MMLU-STEM** | acc (5-shot) | 25.28 | 28.42 | **+3.14** | 25.00 | β |
| **LAMBADA** | acc | 15.35 | 7.01 | -8.34 | ~0 | β |
| **OpenBookQA** | acc_norm | 29.00 | 26.80 | -2.20 | 25.00 | β |
| **SciQ** | acc_norm | 61.20 | 52.70 | -8.50 | 25.00 | β |
| **Mean** | | 39.48 | 36.59 | **-2.89** | | |
**Context**: Qwen3-0.6B-Base was trained on ~36T tokens (3,600x our budget). On the 6 tasks with published Qwen3 scores, our SFT model achieves 47-80% of Qwen3 performance. SFT narrows the gap on reasoning tasks like ARC-Challenge (71% of Qwen3, up from 66% pre-SFT).
<div align="center">
<img src="figures/sft_base_vs_sft_benchmarks.png" alt="Base vs SFT benchmark comparison" width="80%">
</div>
### Forgetting Analysis
<div align="center">
<img src="figures/sft_delta_chart.png" alt="Per-benchmark delta: SFT minus Base" width="80%">
</div>
**Pattern**: Tasks requiring reasoning (ARC-Challenge +1.88, MMLU-STEM +3.14) improved, while tasks measuring text fluency (LAMBADA -8.34, SciQ -8.50) regressed. Mean regression of 2.89 pp is moderate and acceptable for math-focused SFT. 9/10 benchmarks remain above random.
### GSM8K
GSM8K generation-based evaluation was not completed due to compute budget constraints. Without KV cache, autoregressive generation of 1,319 test examples required ~9+ hours of A100 GPU time. Indirect evidence of SFT effectiveness includes the converged training loss (0.91) and MMLU-STEM improvement (+3.14 pp with large gains on quantitative subtasks). See the [full evaluation report](https://github.com/danielxmed/PolyGLU/blob/main/paper_reporting/sft__performance.md) for details.
---
## Architecture
| | |
|---|---|
| **Parameters** | 597M total (~1.4M routing, 0.23% overhead) |
| **Hidden dim** | 1,024 |
| **FFN dim** | 4,096 |
| **Layers** | 28 |
| **Attention** | GQA (16 query / 8 KV heads, head dim 64) |
| **Context** | 4,096 tokens |
| **Vocab** | 151,669 ([Qwen3](https://huggingface.co/Qwen/Qwen3-0.6B-Base) tokenizer) |
| **Position encoding** | RoPE (theta=10,000) |
| **Normalization** | RMSNorm (pre-norm) + QK-Norm |
| **FFN** | **PolyGLU** (K=4: ReLU, Tanh, SiLU, GELU) |
| **Weight tying** | Embedding <-> output head |
---
## Usage
This model was trained from scratch in pure PyTorch (no HuggingFace model wrappers). To load and use it:
```python
import torch
from transformers import AutoTokenizer
# Clone the training repo for model code
# git clone https://github.com/danielxmed/PolyGLU.git
from src.model.config import ModelConfig
from src.model.model import load_checkpoint
# Load model
config = ModelConfig(use_flash_attn=False)
model, step, tau = load_checkpoint("path/to/model.safetensors", config, device="cuda")
model.eval()
# Tokenize (ChatML format for instruct model)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-0.6B-Base")
prompt = "<|im_start|>user\nWhat is 15% of 240?<|im_end|>\n<|im_start|>assistant\n"
input_ids = tokenizer(prompt, return_tensors="pt")["input_ids"].cuda()
# Generate (greedy, no KV cache)
with torch.no_grad():
for _ in range(200):
logits = model(input_ids)
next_token = logits[:, -1, :].argmax(dim=-1, keepdim=True)
input_ids = torch.cat([input_ids, next_token], dim=1)
if next_token.item() == tokenizer.eos_token_id:
break
print(tokenizer.decode(input_ids[0]))
```
> **Note**: This model loads from the custom PyTorch checkpoint format. The `load_checkpoint` function in the PolyGLU repo handles both `.pt` and `.safetensors` formats. See the [GitHub repo](https://github.com/danielxmed/PolyGLU) for full details.
---
## Limitations
- **No GSM8K evaluation** β generation-based evaluation was too expensive without KV cache (~9h for 1,319 examples). This is the most significant evaluation gap.
- **Math-only SFT** β fine-tuned exclusively on math problems. General instruction-following capability is limited.
- **10B token pre-training budget** β significantly less than comparable production models.
- **No KV cache** β inference requires the full training codebase; generation is slow.
- **English only** β trained exclusively on English-language data.
- **Single-epoch SFT** β additional epochs might improve performance but risk overfitting.
---
## Citation
```bibtex
@misc{nobrega2026polychromaticLM,
title = {PolychromaticLM: State-Conditional Activation Routing via Neurotransmitter-Inspired Gated Linear Units},
author = {Daniel Nobrega},
year = {2026},
url = {https://huggingface.co/tylerxdurden/PolyChromaticLM-1.0-instruct-0.6B}
}
```
---
## Links
| | |
|---|---|
| **Code** | [github.com/danielxmed/PolyGLU](https://github.com/danielxmed/PolyGLU) |
| **Base Model** | [PolyChromaticLM-1.0-base-0.6B](https://huggingface.co/tylerxdurden/PolyChromaticLM-1.0-base-0.6B) |
| **Instruct Model** | [PolyChromaticLM-1.0-instruct-0.6B](https://huggingface.co/tylerxdurden/PolyChromaticLM-1.0-instruct-0.6B) |
| **Weights & Biases** | [polychromatic-lm](https://wandb.ai/danielmedeiros-medeiros-nobrega-medtech/polychromatic-lm) |
---
<div align="center">
<i>Built from scratch on a single A100. Independent research by Daniel Nobrega.</i>
</div>
|