--- language: en license: apache-2.0 base_model: Zyphra/Zamba2-1.2B tags: - zamba2 - mamba - hybrid - compressed - hxq - helix-substrate - vector-quantization - helixcode library_name: transformers pipeline_tag: text-generation model-index: - name: zamba2-1.2b-helix results: - task: type: text-generation name: Text Generation dataset: name: HellaSwag type: hellaswag metrics: - type: acc_norm value: 0.7112 name: Accuracy (norm) - task: type: text-generation name: Text Generation dataset: name: ARC-Easy type: ai2_arc config: ARC-Easy metrics: - type: acc_norm value: 0.7445 name: Accuracy (norm) - task: type: text-generation name: Text Generation dataset: name: ARC-Challenge type: ai2_arc config: ARC-Challenge metrics: - type: acc_norm value: 0.4821 name: Accuracy (norm) - task: type: text-generation name: Text Generation dataset: name: WikiText-2 type: wikitext config: wikitext-2-raw-v1 split: test metrics: - type: perplexity value: 5.617 name: Perplexity verified: true --- # Zamba2-1.2B-HXQ > **1.7x smaller from BF16. HellaSwag 71.1%. Fits in 1.35 GB.** > > Zamba2-1.2B (hybrid Mamba2 + Transformer) compressed from 2.3 GB (BF16) to 1.35 GB. Downstream task scores match the dense model after 1.7x compression. No calibration data. No architecture-specific tuning. Just `pip install` and `from_pretrained()`. ## Install and Run ```bash pip install "helix-substrate[hf]" ``` ```python import helix_substrate # registers the HXQ quantizer with HuggingFace from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained("EchoLabs33/zamba2-1.2b-helix") tokenizer = AutoTokenizer.from_pretrained("EchoLabs33/zamba2-1.2b-helix") inputs = tokenizer("The capital of France is", return_tensors="pt") outputs = model.generate(**inputs, max_new_tokens=20) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` That's it. `import helix_substrate` registers the quantizer. `from_pretrained()` handles the rest automatically. ## Downstream Benchmarks Evaluated with [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) on an NVIDIA 4090: | Benchmark | HXQ (1.7x) | Dense (Zyphra reported) | |-----------|---|---| | **HellaSwag** (acc_norm) | **71.12%** | ~69-72% | | **ARC-Easy** (acc_norm) | **74.45%** | — | | **ARC-Challenge** (acc_norm) | **48.21%** | — | Task performance is preserved after 1.7x compression. These are real downstream scores, not PPL proxies. ## Compression Benchmark | | Dense (BF16) | HXQ | |---|---|---| | **Size** | 2.3 GB | **1.35 GB** | | **Perplexity** (WikiText-2) | 5.458 | 5.617 **(+2.90%)** | | **Compression ratio** | — | **1.7x** | | **Compressed modules** | — | 136 HelixLinear layers | | **Architecture** | Zamba2 (Mamba2 + shared Transformer) | unchanged | Eval: WikiText-2 test split, 2048 tokens, stride 512. ## Good to Know - **GPU and CPU supported** — runs on any CUDA GPU or CPU via standard PyTorch. Fused kernels for additional speedup are in progress. - **Fine-tunable via LoRA** — compressed weights remain frozen, but LoRA adapters attach to each `HelixLinear` layer via `HelixLinearSTE`. See `helix-substrate` for training infrastructure. - **Requires `helix-substrate`** — the quantizer is not built into transformers. You need `pip install "helix-substrate[hf]"`. - **+2.90% PPL delta** — measurable but small. Whether this matters depends on your use case. - **`mamba-ssm` recommended** — without it, falls back to a slower sequential code path. ## What is HelixCode? HelixCode is a universal weight compression codec based on vector quantization: - Each weight matrix is replaced by a **256-entry codebook** (float32) + **uint8 index matrix** + optional **sidecar corrections** for outlier values - The compressed form *is* the executable — `HelixLinear` performs `codebook[indices] @ x` directly, no decompression step - Works on any `nn.Linear` regardless of architecture (Transformer, Mamba, MLP, CNN) - **No calibration data required** — unlike GPTQ/AWQ, codebooks are fit from the weights alone ## How It Works 1. `import helix_substrate` registers the `hxq` quantizer with HuggingFace 2. `from_pretrained()` reads `quantization_config.quant_method = "hxq"` from `config.json` 3. The quantizer replaces 136 `nn.Linear` modules with `HelixLinear` shells before weight loading 4. Safetensors populates the codebook, indices, and sidecar buffers directly 5. The model runs in compressed form — no decompression needed ## Architecture Details Zamba2-1.2B is a hybrid architecture with: - **32 Mamba2 layers** (SSM blocks with in_proj + out_proj linear layers) - **6 hybrid layers** (Mamba2 + shared Transformer decoder with attention + MLP) - **1 shared Transformer block** (reused at layers 5, 11, 17, 23, 29, 35) - **38 total layers**, hidden_size=2048 All 136 linear layers (Mamba projections, attention Q/K/V/O, MLP gate/up/down, adapter layers) are compressed. Normalization layers, embeddings, and Mamba-specific parameters (A_log, D, dt_bias, conv1d) are stored at full precision. ## Compression Receipt ``` Compressed tensors: 136 Exact tensors: 156 (norms, embeddings, biases) From original model: 114 (A_log, D, dt_bias, conv1d) Total keys: 814 Output size: 1,350 MB Compression ratio: 1.7x PPL delta: +2.90% (5.617 vs 5.458 dense) Eval: WikiText-2 test, 2048 tokens, stride=512 ``` ## Companion Models Same codec, same `pip install`, multiple architectures: | Model | Architecture | Ratio | PPL Delta | |-------|-------------|-------|-----------| | [qwen2.5-14b-instruct-helix](https://huggingface.co/EchoLabs33/qwen2.5-14b-instruct-helix) | Transformer | 3.4x | pending | | [qwen2.5-7b-instruct-helix](https://huggingface.co/EchoLabs33/qwen2.5-7b-instruct-helix) | Transformer | 2.2x | +6.34% | | [qwen2.5-3b-instruct-helix](https://huggingface.co/EchoLabs33/qwen2.5-3b-instruct-helix) | Transformer | 1.6x | +0.69% | | [qwen2.5-coder-3b-helix](https://huggingface.co/EchoLabs33/qwen2.5-coder-3b-helix) | Transformer (code) | 1.6x | +1.92% | | [qwen2.5-coder-1.5b-instruct-helix](https://huggingface.co/EchoLabs33/qwen2.5-coder-1.5b-instruct-helix) | Transformer (code) | 2.4x | +1.63% | | [tinyllama-1.1b-helix](https://huggingface.co/EchoLabs33/tinyllama-1.1b-helix) | Transformer | 4.0x | +0.78% | | [zamba2-2.7b-instruct-helix](https://huggingface.co/EchoLabs33/zamba2-2.7b-instruct-helix) | Hybrid (Mamba2+Transformer) | 1.8x | +6.59% | | [mamba2-1.3b-helix](https://huggingface.co/EchoLabs33/mamba2-1.3b-helix) | Pure SSM (Mamba2) | 2.1x | +8.0% | | [mamba-130m-helix](https://huggingface.co/EchoLabs33/mamba-130m-helix) | Pure SSM | 3.8x | +18.4% | ## Citation ```bibtex @software{helix_substrate_2026, title={Helix Substrate: Universal Weight Compression via HelixCode}, author={EchoLabs}, year={2026}, url={https://github.com/echo313unfolding/helix-substrate} } ``` ## License Apache 2.0 (inherited from [Zyphra/Zamba2-1.2B](https://huggingface.co/Zyphra/Zamba2-1.2B)).