---
language: en
license: apache-2.0
base_model: Zyphra/Zamba2-1.2B
tags:
  - zamba2
  - mamba
  - hybrid
  - compressed
  - hxq
  - helix-substrate
  - vector-quantization
  - helixcode
library_name: transformers
pipeline_tag: text-generation
model-index:
  - name: zamba2-1.2b-helix
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: HellaSwag
          type: hellaswag
        metrics:
          - type: acc_norm
            value: 0.7112
            name: Accuracy (norm)
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: ARC-Easy
          type: ai2_arc
          config: ARC-Easy
        metrics:
          - type: acc_norm
            value: 0.7445
            name: Accuracy (norm)
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: ARC-Challenge
          type: ai2_arc
          config: ARC-Challenge
        metrics:
          - type: acc_norm
            value: 0.4821
            name: Accuracy (norm)
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: WikiText-2
          type: wikitext
          config: wikitext-2-raw-v1
          split: test
        metrics:
          - type: perplexity
            value: 5.617
            name: Perplexity
            verified: true
---

# Zamba2-1.2B-HXQ

> **1.7x smaller from BF16. HellaSwag 71.1%. Fits in 1.35 GB.**
>
> Zamba2-1.2B (hybrid Mamba2 + Transformer) compressed from 2.3 GB (BF16) to 1.35 GB. Downstream task scores match the dense model after 1.7x compression. No calibration data. No architecture-specific tuning. Just `pip install` and `from_pretrained()`.

## Install and Run

```bash
pip install "helix-substrate[hf]"
```

```python
import helix_substrate  # registers the HXQ quantizer with HuggingFace
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("EchoLabs33/zamba2-1.2b-helix")
tokenizer = AutoTokenizer.from_pretrained("EchoLabs33/zamba2-1.2b-helix")

inputs = tokenizer("The capital of France is", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

That's it. `import helix_substrate` registers the quantizer. `from_pretrained()` handles the rest automatically.

## Downstream Benchmarks

Evaluated with [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) on an NVIDIA 4090:

| Benchmark | HXQ (1.7x) | Dense (Zyphra reported) |
|-----------|---|---|
| **HellaSwag** (acc_norm) | **71.12%** | ~69-72% |
| **ARC-Easy** (acc_norm) | **74.45%** | — |
| **ARC-Challenge** (acc_norm) | **48.21%** | — |

Task performance is preserved after 1.7x compression. These are real downstream scores, not PPL proxies.

## Compression Benchmark

| | Dense (BF16) | HXQ |
|---|---|---|
| **Size** | 2.3 GB | **1.35 GB** |
| **Perplexity** (WikiText-2) | 5.458 | 5.617 **(+2.90%)** |
| **Compression ratio** | — | **1.7x** |
| **Compressed modules** | — | 136 HelixLinear layers |
| **Architecture** | Zamba2 (Mamba2 + shared Transformer) | unchanged |

Eval: WikiText-2 test split, 2048 tokens, stride 512.

## Good to Know

- **GPU and CPU supported** — runs on any CUDA GPU or CPU via standard PyTorch. Fused kernels for additional speedup are in progress.
- **Fine-tunable via LoRA** — compressed weights remain frozen, but LoRA adapters attach to each `HelixLinear` layer via `HelixLinearSTE`. See `helix-substrate` for training infrastructure.
- **Requires `helix-substrate`** — the quantizer is not built into transformers. You need `pip install "helix-substrate[hf]"`.
- **+2.90% PPL delta** — measurable but small. Whether this matters depends on your use case.
- **`mamba-ssm` recommended** — without it, falls back to a slower sequential code path.

## What is HelixCode?

HelixCode is a universal weight compression codec based on vector quantization:

- Each weight matrix is replaced by a **256-entry codebook** (float32) + **uint8 index matrix** + optional **sidecar corrections** for outlier values
- The compressed form *is* the executable — `HelixLinear` performs `codebook[indices] @ x` directly, no decompression step
- Works on any `nn.Linear` regardless of architecture (Transformer, Mamba, MLP, CNN)
- **No calibration data required** — unlike GPTQ/AWQ, codebooks are fit from the weights alone

## How It Works

1. `import helix_substrate` registers the `hxq` quantizer with HuggingFace
2. `from_pretrained()` reads `quantization_config.quant_method = "hxq"` from `config.json`
3. The quantizer replaces 136 `nn.Linear` modules with `HelixLinear` shells before weight loading
4. Safetensors populates the codebook, indices, and sidecar buffers directly
5. The model runs in compressed form — no decompression needed

## Architecture Details

Zamba2-1.2B is a hybrid architecture with:
- **32 Mamba2 layers** (SSM blocks with in_proj + out_proj linear layers)
- **6 hybrid layers** (Mamba2 + shared Transformer decoder with attention + MLP)
- **1 shared Transformer block** (reused at layers 5, 11, 17, 23, 29, 35)
- **38 total layers**, hidden_size=2048

All 136 linear layers (Mamba projections, attention Q/K/V/O, MLP gate/up/down, adapter layers) are compressed. Normalization layers, embeddings, and Mamba-specific parameters (A_log, D, dt_bias, conv1d) are stored at full precision.

## Compression Receipt

```
Compressed tensors:  136
Exact tensors:       156  (norms, embeddings, biases)
From original model: 114  (A_log, D, dt_bias, conv1d)
Total keys:          814
Output size:         1,350 MB
Compression ratio:   1.7x
PPL delta:           +2.90% (5.617 vs 5.458 dense)
Eval: WikiText-2 test, 2048 tokens, stride=512
```

## Companion Models

Same codec, same `pip install`, multiple architectures:

| Model | Architecture | Ratio | PPL Delta |
|-------|-------------|-------|-----------|
| [qwen2.5-14b-instruct-helix](https://huggingface.co/EchoLabs33/qwen2.5-14b-instruct-helix) | Transformer | 3.4x | pending |
| [qwen2.5-7b-instruct-helix](https://huggingface.co/EchoLabs33/qwen2.5-7b-instruct-helix) | Transformer | 2.2x | +6.34% |
| [qwen2.5-3b-instruct-helix](https://huggingface.co/EchoLabs33/qwen2.5-3b-instruct-helix) | Transformer | 1.6x | +0.69% |
| [qwen2.5-coder-3b-helix](https://huggingface.co/EchoLabs33/qwen2.5-coder-3b-helix) | Transformer (code) | 1.6x | +1.92% |
| [qwen2.5-coder-1.5b-instruct-helix](https://huggingface.co/EchoLabs33/qwen2.5-coder-1.5b-instruct-helix) | Transformer (code) | 2.4x | +1.63% |
| [tinyllama-1.1b-helix](https://huggingface.co/EchoLabs33/tinyllama-1.1b-helix) | Transformer | 4.0x | +0.78% |
| [zamba2-2.7b-instruct-helix](https://huggingface.co/EchoLabs33/zamba2-2.7b-instruct-helix) | Hybrid (Mamba2+Transformer) | 1.8x | +6.59% |
| [mamba2-1.3b-helix](https://huggingface.co/EchoLabs33/mamba2-1.3b-helix) | Pure SSM (Mamba2) | 2.1x | +8.0% |
| [mamba-130m-helix](https://huggingface.co/EchoLabs33/mamba-130m-helix) | Pure SSM | 3.8x | +18.4% |

## Citation

```bibtex
@software{helix_substrate_2026,
  title={Helix Substrate: Universal Weight Compression via HelixCode},
  author={EchoLabs},
  year={2026},
  url={https://github.com/echo313unfolding/helix-substrate}
}
```

## License

Apache 2.0 (inherited from [Zyphra/Zamba2-1.2B](https://huggingface.co/Zyphra/Zamba2-1.2B)).