---
language: en
license: apache-2.0
base_model: allenai/OLMoE-1B-7B-0924-Instruct
tags:
  - olmoe
  - moe
  - mixture-of-experts
  - compressed
  - hxq
  - helix-substrate
  - vector-quantization
  - helixcode
library_name: transformers
pipeline_tag: text-generation
model-index:
  - name: olmoe-1b-7b-instruct-helix
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: HellaSwag
          type: hellaswag
        metrics:
          - type: acc_norm
            value: 0.7876
            name: Accuracy (norm)
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: ARC-Easy
          type: ai2_arc
          config: ARC-Easy
        metrics:
          - type: acc_norm
            value: 0.7685
            name: Accuracy (norm)
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: ARC-Challenge
          type: ai2_arc
          config: ARC-Challenge
        metrics:
          - type: acc_norm
            value: 0.5205
            name: Accuracy (norm)
---

# OLMoE-1B-7B-Instruct-HXQ

> **1.9x smaller from BF16. HellaSwag 78.8%. First MoE compressed with HXQ.**
>
> OLMoE-1B-7B-Instruct (64-expert Mixture-of-Experts, 1B active / 6.9B total) compressed from 13 GB (BF16) to 6.7 GB. All three downstream benchmarks within noise of the dense baseline. No calibration data. No architecture-specific tuning. Just `pip install` and `from_pretrained()`.

## Install and Run

```bash
pip install "helix-substrate[hf]"
```

```python
import helix_substrate  # registers the HXQ quantizer with HuggingFace
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "EchoLabs33/olmoe-1b-7b-instruct-helix",
    trust_remote_code=True,
    torch_dtype="bfloat16",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("EchoLabs33/olmoe-1b-7b-instruct-helix")

inputs = tokenizer("The capital of France is", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

That's it. `import helix_substrate` registers the quantizer. `from_pretrained()` handles the rest automatically.

## Downstream Benchmarks

Evaluated with [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) v0.4.11 on an NVIDIA RTX 3090 (batch=4, dtype=bfloat16):

| Benchmark | Dense (acc_norm) | HXQ (acc_norm) | Delta |
|-----------|-----------------|----------------|-------|
| **HellaSwag** | 78.92% | **78.76%** | **-0.16%** |
| **ARC-Challenge** | 52.13% | **52.05%** | **-0.08%** |
| **ARC-Easy** | 75.72% | **76.85%** | **+1.14%** |

All deltas within standard error. Task performance is preserved after 1.9x compression. These are real downstream scores from paired dense/HXQ evaluations, not PPL proxies.

## Compression Benchmark

| | Dense (BF16) | HXQ |
|---|---|---|
| **Size** | 13 GB | **6.7 GB** |
| **Compression ratio** | — | **1.9x** |
| **VRAM (eval)** | 13,886 MB | **7,540 MB** |
| **Compressed modules** | — | 3,152 HelixLinear layers |
| **Architecture** | OLMoE (64-expert MoE) | unchanged |

## Verification Status

- **Compression receipt:** PASS — 3,152 compressed, 67 exact, 12,675 total keys
- **Conversion receipt:** PASS — SHA256 `a9f74982b746853077d13dc11c8bc863dc91219c81e22577de1de2b195c7b836`
- **Downstream eval:** PASS — paired dense/HXQ on HellaSwag, ARC-Easy, ARC-Challenge

## Good to Know

- **GPU and CPU supported** — runs on any CUDA GPU or CPU via standard PyTorch.
- **`trust_remote_code=True` required** — OLMoE uses custom modeling code.
- **Fine-tunable via LoRA** — compressed weights remain frozen, but LoRA adapters attach to each `HelixLinear` layer via `HelixLinearSTE`. See `helix-substrate` for training infrastructure.
- **Requires `helix-substrate`** — the quantizer is not built into transformers. You need `pip install "helix-substrate[hf]"`.
- **64 experts = slow eval** — lm-eval-harness takes ~5.5 hours on a 3090 due to MoE routing overhead. Inference speed is normal for interactive use.

## What is HelixCode?

HelixCode is a universal weight compression codec based on vector quantization:

- Each weight matrix is replaced by a **256-entry codebook** (float32) + **uint8 index matrix** + optional **sidecar corrections** for outlier values
- The compressed form *is* the executable — `HelixLinear` performs `codebook[indices] @ x` directly, no decompression step
- Works on any `nn.Linear` regardless of architecture (Transformer, Mamba, MoE, CNN)
- **No calibration data required** — unlike GPTQ/AWQ, codebooks are fit from the weights alone

## How It Works

1. `import helix_substrate` registers the `hxq` quantizer with HuggingFace
2. `from_pretrained()` reads `quantization_config.quant_method = "hxq"` from `config.json`
3. The quantizer replaces 3,152 `nn.Linear` modules with `HelixLinear` shells before weight loading
4. Safetensors populates the codebook, indices, and sidecar buffers directly
5. The model runs in compressed form — no decompression needed

## Architecture Details

OLMoE-1B-7B-Instruct is a Mixture-of-Experts architecture with:
- **16 transformer layers**, each with attention + MoE MLP
- **64 experts per layer**, top-8 routing (1B active / 6.9B total parameters)
- **hidden_size=2048**, intermediate_size=1024 per expert
- **16 attention heads**, no GQA (num_kv_heads=16)

All 3,152 linear layers are compressed:
- **3,072 expert projections** (64 experts x 3 projections x 16 layers)
- **64 attention projections** (Q/K/V/O across 16 layers)
- **16 router gates** (expert routing per layer)

Normalization layers (33), embeddings (1), and lm_head (1) are stored at full precision.

## Why This Matters

OLMoE is the first **Mixture-of-Experts** model compressed with HXQ. Combined with existing Transformer, SSM, and Hybrid results, this demonstrates that the same codec — same codebook size, same algorithm, same `pip install` — works across four distinct architecture families without modification.

## Companion Models

Same codec, same `pip install`, multiple architectures:

| Model | Architecture | Ratio | Eval Delta |
|-------|-------------|-------|------------|
| **olmoe-1b-7b-instruct-helix** | **MoE (64 experts)** | **1.9x** | **-0.16% HellaSwag** |
| [zamba2-2.7b-instruct-helix](https://huggingface.co/EchoLabs33/zamba2-2.7b-instruct-helix) | Hybrid (Mamba2+Transformer) | 1.8x | +6.59% PPL |
| [zamba2-1.2b-helix](https://huggingface.co/EchoLabs33/zamba2-1.2b-helix) | Hybrid (Mamba2+Transformer) | 1.7x | +2.90% PPL |
| [qwen2.5-14b-instruct-helix](https://huggingface.co/EchoLabs33/qwen2.5-14b-instruct-helix) | Transformer | 3.4x | pending |
| [qwen2.5-7b-instruct-helix](https://huggingface.co/EchoLabs33/qwen2.5-7b-instruct-helix) | Transformer | 2.2x | +6.34% PPL |
| [qwen2.5-3b-instruct-helix](https://huggingface.co/EchoLabs33/qwen2.5-3b-instruct-helix) | Transformer | 1.6x | +0.69% PPL |
| [qwen2.5-coder-3b-helix](https://huggingface.co/EchoLabs33/qwen2.5-coder-3b-helix) | Transformer (code) | 1.6x | +1.92% PPL |
| [qwen2.5-coder-1.5b-instruct-helix](https://huggingface.co/EchoLabs33/qwen2.5-coder-1.5b-instruct-helix) | Transformer (code) | 2.4x | +1.63% PPL |
| [tinyllama-1.1b-helix](https://huggingface.co/EchoLabs33/tinyllama-1.1b-helix) | Transformer | 4.0x | +0.78% PPL |
| [mamba2-1.3b-helix](https://huggingface.co/EchoLabs33/mamba2-1.3b-helix) | Pure SSM (Mamba2) | 2.1x | +8.0% PPL |
| [mamba-130m-helix](https://huggingface.co/EchoLabs33/mamba-130m-helix) | Pure SSM | 3.8x | +18.4% PPL |

## Citation

```bibtex
@software{helix_substrate_2026,
  title={Helix Substrate: Universal Weight Compression via HelixCode},
  author={EchoLabs},
  year={2026},
  url={https://github.com/echo313unfolding/helix-substrate}
}
```

## License

Apache 2.0 (inherited from [allenai/OLMoE-1B-7B-0924-Instruct](https://huggingface.co/allenai/OLMoE-1B-7B-0924-Instruct)).