--- language: en license: apache-2.0 base_model: state-spaces/mamba2-1.3b tags: - mamba2 - ssm - state-space-model - compressed - hxq - helix-substrate - vector-quantization - helixcode library_name: transformers pipeline_tag: text-generation model-index: - name: mamba2-1.3b-helix results: - task: type: text-generation name: Text Generation dataset: name: WikiText-2 type: wikitext config: wikitext-2-raw-v1 split: test metrics: - type: perplexity value: 10.00 name: Perplexity verified: true --- # Mamba2-1.3B-HXQ > **2.1x smaller. +8.0% PPL. Pure Mamba2 SSM scales with compression.** > > Mamba2-1.3B compressed from 2.9 GB to 1.4 GB with +8.0% perplexity. Down from +18.4% at 130M — SSM compression quality improves with model size. No calibration data. Just `pip install` and `from_pretrained()`. ## Install and Run ```bash pip install "helix-substrate[hf]" ``` ```python import helix_substrate # registers the HXQ quantizer with HuggingFace from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained("EchoLabs33/mamba2-1.3b-helix") tokenizer = AutoTokenizer.from_pretrained("EchoLabs33/mamba2-1.3b-helix") inputs = tokenizer("The future of artificial intelligence", return_tensors="pt") outputs = model.generate(**inputs, max_new_tokens=64) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` That's it. `import helix_substrate` registers the quantizer. `from_pretrained()` handles the rest automatically. ## Benchmark | | Dense (BF16) | HXQ | |---|---|---| | **Size** | 2.9 GB | **1.4 GB** | | **Perplexity** (WikiText-2) | 9.26 | 10.00 **(+8.0%)** | | **Compression ratio** | — | **2.1x** | | **Compressed modules** | — | 98 (in_proj + out_proj + embedding) | | **Architecture** | Mamba2 (48 layers, pure SSM) | unchanged | ## Good to Know - **+8.0% PPL delta** — higher than transformers at this scale, but down from +18.4% at 130M. SSM compression quality scales with model size. - **GPU and CPU supported** — runs on any CUDA GPU or CPU via standard PyTorch. Fused kernels for additional speedup are in progress. - **Fine-tunable via LoRA** — compressed weights remain frozen, but LoRA adapters attach to each `HelixLinear` layer via `HelixLinearSTE`. See `helix-substrate` for training infrastructure. - **Requires `helix-substrate`** — the quantizer is not built into transformers. You need `pip install "helix-substrate[hf]"`. - **`mamba-ssm` recommended** — without it, falls back to a slower sequential code path. - **Requires `transformers >= 4.45`** — for Mamba2 architecture support. ## What is HelixCode? HelixCode is a universal weight compression codec based on vector quantization: - Each weight matrix is replaced by a **256-entry codebook** (float32) + **uint8 index matrix** + optional **sidecar corrections** for outlier values - The compressed form *is* the executable — `HelixLinear` performs `codebook[indices] @ x` directly, no decompression step - Works on any `nn.Linear` regardless of architecture (Transformer, Mamba, MLP, CNN) - **No calibration data required** — unlike GPTQ/AWQ, codebooks are fit from the weights alone ## How It Works 1. `import helix_substrate` registers the `hxq` quantizer with HuggingFace 2. `from_pretrained()` reads `quantization_config.quant_method = "hxq"` from `config.json` 3. The quantizer replaces 98 modules with `HelixLinear` shells before weight loading 4. Safetensors populates the codebook, indices, and sidecar buffers directly 5. The model runs in compressed form — no decompression needed ## Architecture Details Mamba2-1.3B is a pure state-space model with: - **48 Mamba2 layers** (SSD blocks with in_proj + out_proj linear layers) - **hidden_size=2048**, **64 heads**, **head_dim=64** - **state_size=128**, **conv_kernel=4** Only the in_proj, out_proj, and embedding layers are VQ-compressed. Mamba2-specific parameters (A_log, D, dt_bias, conv1d, norms) are stored at full precision. ## Compression Receipt ``` Compressed modules: 98 (in_proj + out_proj + embedding) Exact tensors: 337 (A_log, D, dt_bias, conv1d, norms) Total keys: 736 Output size: 1,388 MB Weight ratio: 2.1x PPL delta: +8.0% (10.00 vs 9.26 dense) Eval: WikiText-2 test, 2048 tokens, stride=512 ``` ## Companion Models Same codec, same `pip install`, multiple architectures: | Model | Architecture | Ratio | PPL Delta | |-------|-------------|-------|-----------| | [qwen2.5-14b-instruct-helix](https://huggingface.co/EchoLabs33/qwen2.5-14b-instruct-helix) | Transformer | 3.4x | pending | | [qwen2.5-7b-instruct-helix](https://huggingface.co/EchoLabs33/qwen2.5-7b-instruct-helix) | Transformer | 2.2x | +6.34% | | [qwen2.5-3b-instruct-helix](https://huggingface.co/EchoLabs33/qwen2.5-3b-instruct-helix) | Transformer | 1.6x | +0.69% | | [qwen2.5-coder-3b-helix](https://huggingface.co/EchoLabs33/qwen2.5-coder-3b-helix) | Transformer (code) | 1.6x | +1.92% | | [qwen2.5-coder-1.5b-instruct-helix](https://huggingface.co/EchoLabs33/qwen2.5-coder-1.5b-instruct-helix) | Transformer (code) | 2.4x | +1.63% | | [tinyllama-1.1b-helix](https://huggingface.co/EchoLabs33/tinyllama-1.1b-helix) | Transformer | 4.0x | +0.78% | | [zamba2-2.7b-instruct-helix](https://huggingface.co/EchoLabs33/zamba2-2.7b-instruct-helix) | Hybrid (Mamba2+Transformer) | 1.8x | +6.59% | | [zamba2-1.2b-helix](https://huggingface.co/EchoLabs33/zamba2-1.2b-helix) | Hybrid (Mamba2+Transformer) | 1.7x | +2.90% | | [mamba-130m-helix](https://huggingface.co/EchoLabs33/mamba-130m-helix) | Pure SSM | 3.8x | +18.4% | ## Citation ```bibtex @software{helix_substrate_2026, title={Helix Substrate: Universal Weight Compression via HelixCode}, author={EchoLabs}, year={2026}, url={https://github.com/echo313unfolding/helix-substrate} } ``` ## License Apache 2.0 (inherited from [state-spaces/mamba2-1.3b](https://huggingface.co/state-spaces/mamba2-1.3b)).