File size: 12,022 Bytes
c34c933
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3f0c8ee
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c34c933
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
---
license: apache-2.0
language:
  - en
library_name: pytorch
pipeline_tag: text-generation
tags:
  - instruct
  - sft
  - transformer
  - PolyGLU
  - activation-routing
  - math
  - research
  - from-scratch
base_model: tylerxdurden/PolyChromaticLM-1.0-base-0.6B
model-index:
  - name: PolyChromaticLM-1.0-instruct-0.6B
    results:
      - task:
          type: multiple-choice
          name: HellaSwag
        dataset:
          name: HellaSwag
          type: hellaswag
        metrics:
          - type: acc_norm
            value: 27.84
            name: Normalized Accuracy
      - task:
          type: multiple-choice
          name: ARC-Easy
        dataset:
          name: ARC-Easy
          type: ai2_arc
          config: ARC-Easy
        metrics:
          - type: acc_norm
            value: 36.11
            name: Normalized Accuracy
      - task:
          type: multiple-choice
          name: ARC-Challenge
        dataset:
          name: ARC-Challenge
          type: ai2_arc
          config: ARC-Challenge
        metrics:
          - type: acc_norm
            value: 24.15
            name: Normalized Accuracy
      - task:
          type: multiple-choice
          name: PIQA
        dataset:
          name: PIQA
          type: piqa
        metrics:
          - type: acc_norm
            value: 54.52
            name: Normalized Accuracy
      - task:
          type: multiple-choice
          name: WinoGrande
        dataset:
          name: WinoGrande
          type: winogrande
        metrics:
          - type: acc
            value: 52.72
            name: Accuracy
      - task:
          type: multiple-choice
          name: BoolQ
        dataset:
          name: BoolQ
          type: boolq
        metrics:
          - type: acc
            value: 55.63
            name: Accuracy
      - task:
          type: multiple-choice
          name: SciQ
        dataset:
          name: SciQ
          type: sciq
        metrics:
          - type: acc_norm
            value: 52.70
            name: Normalized Accuracy
      - task:
          type: multiple-choice
          name: MMLU-STEM
        dataset:
          name: MMLU-STEM
          type: mmlu
          config: stem
        metrics:
          - type: acc
            value: 28.42
            name: Accuracy (5-shot)
---

<div align="center">

# PolyChromaticLM 1.0 Instruct (0.6B)

**A 597M-parameter transformer with biologically-inspired activation routing, fine-tuned for mathematical reasoning**

*SFT on ~347K math problems from Nemotron-Math-v2, with chain-of-thought solutions in ChatML format.*

[![Paper](https://img.shields.io/badge/arXiv-2026-b31b1b.svg)](https://arxiv.org/)
[![Code](https://img.shields.io/badge/GitHub-PolyGLU-blue.svg)](https://github.com/danielxmed/PolyGLU)
[![License](https://img.shields.io/badge/License-Apache_2.0-green.svg)](https://www.apache.org/licenses/LICENSE-2.0)
[![Base Model](https://img.shields.io/badge/Base-PolyChromaticLM--1.0--base--0.6B-orange.svg)](https://huggingface.co/tylerxdurden/PolyChromaticLM-1.0-base-0.6B)

</div>

---

## Overview

This is the **SFT (instruction-tuned) version** of [PolyChromaticLM-1.0-base-0.6B](https://huggingface.co/tylerxdurden/PolyChromaticLM-1.0-base-0.6B), fine-tuned on mathematical problem-solving data with chain-of-thought reasoning in ChatML format.

The core innovation is **PolyGLU** (Polychromatic Gated Linear Unit) β€” a drop-in SwiGLU replacement that implements **state-conditional activation routing**. Each FFN neuron dynamically selects among K=4 activation functions (ReLU, Tanh, SiLU, GELU) via a differentiable Gumbel-Softmax mechanism.

**Author**: Daniel Nobrega (independent research)

### Key SFT Results

- **Training loss**: 1.77 β†’ 0.91 (48.7% reduction over 1 epoch)
- **Routing entropy: 1.386 (maximum) throughout all 13,067 SFT steps** β€” the PolyGLU routing architecture is fully robust to fine-tuning
- **MMLU-STEM improved by +3.14 pp** after SFT, with large gains on quantitative subtasks (High School Statistics +20.84 pp, College Mathematics +11.00 pp)
- Moderate forgetting on general benchmarks (mean -2.89 pp across 10 tasks) β€” 9/10 benchmarks remain above random

---

## SFT Training

| | |
|---|---|
| **Base checkpoint** | [`PolyChromaticLM-1.0-base-0.6B`](https://huggingface.co/tylerxdurden/PolyChromaticLM-1.0-base-0.6B) (step 19,531, 10.24B tokens) |
| **SFT dataset** | [`nvidia/Nemotron-Math-v2`](https://huggingface.co/datasets/nvidia/Nemotron-Math-v2) (high_part00, ~347K problems) |
| **Format** | ChatML with assistant-only loss masking |
| **Epochs** | 1 |
| **Optimizer** | AdamW (beta1=0.9, beta2=0.95, eps=1e-8) |
| **Peak LR** | 2e-5 (cosine decay, 100-step warmup) |
| **Effective batch** | ~524K tokens (micro_batch=2, grad_accum=16) |
| **Gumbel-Softmax tau** | 0.1 (frozen from pre-training) |
| **Steps** | 13,067 |
| **Hardware** | 1x NVIDIA A100 80GB |
| **Duration** | ~18 hours |
| **Compute cost** | ~$29.50 |
| **Mean throughput** | ~11,447 tok/s |

### Training Dynamics

<div align="center">
<img src="figures/sft_training_dynamics.png" alt="SFT training dynamics: loss curve, learning rate, and throughput" width="90%">
</div>

<details>
<summary><b>Loss curve detail</b></summary>
<img src="figures/sft_loss_curve.png" alt="SFT loss curve from 1.77 to 0.91" width="80%">

| Step | Loss |
|-----:|-----:|
| 10 | 1.77 |
| 500 | ~1.10 |
| 5,000 | ~0.95 |
| 10,000 | ~0.90 |
| 13,067 | **0.91** |

</details>

### Routing Entropy Stability

The most remarkable observation: **routing entropy remained at exactly 1.386 (= ln(4) = maximum entropy for K=4) throughout all 13,067 SFT steps.** This means:

- Static routing preferences learned during pre-training were NOT disturbed by SFT
- PolyGLU neurons maintained equal activation diversity across all 4 functions
- The routing architecture is **robust to fine-tuning** β€” a critical validation of the design

SFT modifies *what* is computed, not *how*: the routing mechanism (which activation function each neuron uses) remains unchanged, while the model's weights adapt to produce chain-of-thought reasoning.

---

## Evaluation

All benchmarks via [EleutherAI lm-evaluation-harness](https://github.com/EleutherAI/lm-eval) v0.4.11, 0-shot unless noted.

### Benchmarks (Base vs SFT vs Qwen3-0.6B-Base)

| Benchmark | Metric | Base | SFT | Delta | Random | Qwen3-0.6B |
|-----------|--------|-----:|----:|------:|-------:|-----------:|
| **HellaSwag** | acc_norm | 28.51 | 27.84 | -0.67 | 25.00 | 41.10 |
| **ARC-Easy** | acc_norm | 41.04 | 36.11 | -4.93 | 25.00 | 65.60 |
| **ARC-Challenge** | acc_norm | 22.27 | 24.15 | +1.88 | 25.00 | 33.90 |
| **PIQA** | acc_norm | 58.87 | 54.52 | -4.35 | 50.00 | 70.00 |
| **WinoGrande** | acc | 52.17 | 52.72 | +0.55 | 50.00 | 58.50 |
| **BoolQ** | acc | 61.13 | 55.63 | -5.50 | 50.00 | 69.70 |
| **MMLU-STEM** | acc (5-shot) | 25.28 | 28.42 | **+3.14** | 25.00 | β€” |
| **LAMBADA** | acc | 15.35 | 7.01 | -8.34 | ~0 | β€” |
| **OpenBookQA** | acc_norm | 29.00 | 26.80 | -2.20 | 25.00 | β€” |
| **SciQ** | acc_norm | 61.20 | 52.70 | -8.50 | 25.00 | β€” |
| **Mean** | | 39.48 | 36.59 | **-2.89** | | |

**Context**: Qwen3-0.6B-Base was trained on ~36T tokens (3,600x our budget). On the 6 tasks with published Qwen3 scores, our SFT model achieves 47-80% of Qwen3 performance. SFT narrows the gap on reasoning tasks like ARC-Challenge (71% of Qwen3, up from 66% pre-SFT).

<div align="center">
<img src="figures/sft_base_vs_sft_benchmarks.png" alt="Base vs SFT benchmark comparison" width="80%">
</div>

### Forgetting Analysis

<div align="center">
<img src="figures/sft_delta_chart.png" alt="Per-benchmark delta: SFT minus Base" width="80%">
</div>

**Pattern**: Tasks requiring reasoning (ARC-Challenge +1.88, MMLU-STEM +3.14) improved, while tasks measuring text fluency (LAMBADA -8.34, SciQ -8.50) regressed. Mean regression of 2.89 pp is moderate and acceptable for math-focused SFT. 9/10 benchmarks remain above random.

### GSM8K

GSM8K generation-based evaluation was not completed due to compute budget constraints. Without KV cache, autoregressive generation of 1,319 test examples required ~9+ hours of A100 GPU time. Indirect evidence of SFT effectiveness includes the converged training loss (0.91) and MMLU-STEM improvement (+3.14 pp with large gains on quantitative subtasks). See the [full evaluation report](https://github.com/danielxmed/PolyGLU/blob/main/paper_reporting/sft__performance.md) for details.

---

## Architecture

| | |
|---|---|
| **Parameters** | 597M total (~1.4M routing, 0.23% overhead) |
| **Hidden dim** | 1,024 |
| **FFN dim** | 4,096 |
| **Layers** | 28 |
| **Attention** | GQA (16 query / 8 KV heads, head dim 64) |
| **Context** | 4,096 tokens |
| **Vocab** | 151,669 ([Qwen3](https://huggingface.co/Qwen/Qwen3-0.6B-Base) tokenizer) |
| **Position encoding** | RoPE (theta=10,000) |
| **Normalization** | RMSNorm (pre-norm) + QK-Norm |
| **FFN** | **PolyGLU** (K=4: ReLU, Tanh, SiLU, GELU) |
| **Weight tying** | Embedding <-> output head |

---

## Usage

This model was trained from scratch in pure PyTorch (no HuggingFace model wrappers). To load and use it:

```python
import torch
from transformers import AutoTokenizer

# Clone the training repo for model code
# git clone https://github.com/danielxmed/PolyGLU.git
from src.model.config import ModelConfig
from src.model.model import load_checkpoint

# Load model
config = ModelConfig(use_flash_attn=False)
model, step, tau = load_checkpoint("path/to/model.safetensors", config, device="cuda")
model.eval()

# Tokenize (ChatML format for instruct model)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-0.6B-Base")
prompt = "<|im_start|>user\nWhat is 15% of 240?<|im_end|>\n<|im_start|>assistant\n"
input_ids = tokenizer(prompt, return_tensors="pt")["input_ids"].cuda()

# Generate (greedy, no KV cache)
with torch.no_grad():
    for _ in range(200):
        logits = model(input_ids)
        next_token = logits[:, -1, :].argmax(dim=-1, keepdim=True)
        input_ids = torch.cat([input_ids, next_token], dim=1)
        if next_token.item() == tokenizer.eos_token_id:
            break

print(tokenizer.decode(input_ids[0]))
```

> **Note**: This model loads from the custom PyTorch checkpoint format. The `load_checkpoint` function in the PolyGLU repo handles both `.pt` and `.safetensors` formats. See the [GitHub repo](https://github.com/danielxmed/PolyGLU) for full details.

---

## Limitations

- **No GSM8K evaluation** β€” generation-based evaluation was too expensive without KV cache (~9h for 1,319 examples). This is the most significant evaluation gap.
- **Math-only SFT** β€” fine-tuned exclusively on math problems. General instruction-following capability is limited.
- **10B token pre-training budget** β€” significantly less than comparable production models.
- **No KV cache** β€” inference requires the full training codebase; generation is slow.
- **English only** β€” trained exclusively on English-language data.
- **Single-epoch SFT** β€” additional epochs might improve performance but risk overfitting.

---

## Citation

```bibtex
@misc{nobrega2026polychromaticLM,
  title   = {PolychromaticLM: State-Conditional Activation Routing via Neurotransmitter-Inspired Gated Linear Units},
  author  = {Daniel Nobrega},
  year    = {2026},
  url     = {https://huggingface.co/tylerxdurden/PolyChromaticLM-1.0-instruct-0.6B}
}
```

---

## Links

| | |
|---|---|
| **Code** | [github.com/danielxmed/PolyGLU](https://github.com/danielxmed/PolyGLU) |
| **Base Model** | [PolyChromaticLM-1.0-base-0.6B](https://huggingface.co/tylerxdurden/PolyChromaticLM-1.0-base-0.6B) |
| **Instruct Model** | [PolyChromaticLM-1.0-instruct-0.6B](https://huggingface.co/tylerxdurden/PolyChromaticLM-1.0-instruct-0.6B) |
| **Weights & Biases** | [polychromatic-lm](https://wandb.ai/danielmedeiros-medeiros-nobrega-medtech/polychromatic-lm) |

---

<div align="center">
<i>Built from scratch on a single A100. Independent research by Daniel Nobrega.</i>
</div>