--- language: en license: apache-2.0 base_model: TinyLlama/TinyLlama-1.1B-Chat-v1.0 tags: - llama - transformer - compressed - hxq - helix-substrate - vector-quantization - helixcode library_name: transformers pipeline_tag: text-generation model-index: - name: tinyllama-1.1b-helix results: - task: type: text-generation name: Text Generation dataset: name: WikiText-2 type: wikitext config: wikitext-2-raw-v1 split: test metrics: - type: perplexity value: 6.220 name: Perplexity verified: true --- # TinyLlama-1.1B-HXQ > **3.99x smaller. +0.78% perplexity. The fidelity reference.** > > TinyLlama-1.1B compressed from 4.4 GB (FP32) to 1.03 GB with the tightest PPL delta in the lineup. No calibration data. No architecture-specific tuning. Just `pip install` and `from_pretrained()`. ## Install and Run ```bash pip install "helix-substrate[hf]" ``` ```python import helix_substrate # registers the HXQ quantizer with HuggingFace from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained("EchoLabs33/tinyllama-1.1b-helix") tokenizer = AutoTokenizer.from_pretrained("EchoLabs33/tinyllama-1.1b-helix") inputs = tokenizer("The meaning of life is", return_tensors="pt") outputs = model.generate(**inputs, max_new_tokens=64) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` That's it. `import helix_substrate` registers the quantizer. `from_pretrained()` handles the rest automatically. ## Benchmark | | Dense (FP32) | HXQ | |---|---|---| | **Size** | 4.4 GB | **1.03 GB** | | **Perplexity** (WikiText-2) | 6.172 | 6.220 **(+0.78%)** | | **Compression ratio** | — | **3.99x** | | **Compressed modules** | — | 154 HelixLinear + 1 nn.Linear (lm_head) | | **Architecture** | LLaMA (22 layers, GQA) | unchanged | Eval: WikiText-2 test split, 2048 tokens, stride 512. ## Good to Know - **GPU and CPU supported** — runs on any CUDA GPU or CPU via standard PyTorch. Fused kernels for additional speedup are in progress. - **Fine-tunable via LoRA** — compressed weights remain frozen, but LoRA adapters attach to each `HelixLinear` layer via `HelixLinearSTE`. See `helix-substrate` for training infrastructure. - **Requires `helix-substrate`** — the quantizer is not built into transformers. You need `pip install "helix-substrate[hf]"`. ## What is HelixCode? HelixCode is a universal weight compression codec based on vector quantization: - Each weight matrix is replaced by a **256-entry codebook** (float32) + **uint8 index matrix** + optional **sidecar corrections** for outlier values - The compressed form *is* the executable — `HelixLinear` performs `codebook[indices] @ x` directly, no decompression step - Works on any `nn.Linear` regardless of architecture (Transformer, Mamba, MLP, CNN) - **No calibration data required** — unlike GPTQ/AWQ, codebooks are fit from the weights alone ## How It Works 1. `import helix_substrate` registers the `hxq` quantizer with HuggingFace 2. `from_pretrained()` reads `quantization_config.quant_method = "hxq"` from `config.json` 3. The quantizer replaces 154 `nn.Linear` modules with `HelixLinear` shells before weight loading 4. Safetensors populates the codebook, indices, and sidecar buffers directly 5. The model runs in compressed form — no decompression needed ## Why TinyLlama? This is the **fidelity benchmark** — at +0.78% PPL, it demonstrates that HelixCode compression introduces negligible degradation on a well-studied reference model. TinyLlama's weights are well-conditioned (low kurtosis), making it the ideal validation target. ## Compression Receipt ``` Compressed tensors: 156 Exact tensors (npy): 45 (norms, embeddings) From original model: 44 Total keys: 753 Output size: 1,053 MB Weight ratio: 3.99x PPL delta: +0.78% (6.220 vs 6.172 dense) Eval: WikiText-2 test, 2048 tokens, stride=512 ``` ## Companion Models Same codec, same `pip install`, multiple architectures: | Model | Architecture | Ratio | PPL Delta | |-------|-------------|-------|-----------| | [qwen2.5-14b-instruct-helix](https://huggingface.co/EchoLabs33/qwen2.5-14b-instruct-helix) | Transformer | 3.4x | pending | | [qwen2.5-7b-instruct-helix](https://huggingface.co/EchoLabs33/qwen2.5-7b-instruct-helix) | Transformer | 2.2x | +6.34% | | [qwen2.5-3b-instruct-helix](https://huggingface.co/EchoLabs33/qwen2.5-3b-instruct-helix) | Transformer | 1.6x | +0.69% | | [qwen2.5-coder-3b-helix](https://huggingface.co/EchoLabs33/qwen2.5-coder-3b-helix) | Transformer (code) | 1.6x | +1.92% | | [qwen2.5-coder-1.5b-instruct-helix](https://huggingface.co/EchoLabs33/qwen2.5-coder-1.5b-instruct-helix) | Transformer (code) | 2.4x | +1.63% | | [zamba2-2.7b-instruct-helix](https://huggingface.co/EchoLabs33/zamba2-2.7b-instruct-helix) | Hybrid (Mamba2+Transformer) | 1.8x | +6.59% | | [zamba2-1.2b-helix](https://huggingface.co/EchoLabs33/zamba2-1.2b-helix) | Hybrid (Mamba2+Transformer) | 1.7x | +2.90% | | [mamba2-1.3b-helix](https://huggingface.co/EchoLabs33/mamba2-1.3b-helix) | Pure SSM (Mamba2) | 2.1x | +8.0% | | [mamba-130m-helix](https://huggingface.co/EchoLabs33/mamba-130m-helix) | Pure SSM | 3.8x | +18.4% | ## Citation ```bibtex @software{helix_substrate_2026, title={Helix Substrate: Universal Weight Compression via HelixCode}, author={EchoLabs}, year={2026}, url={https://github.com/echo313unfolding/helix-substrate} } ``` ## License Apache 2.0 (inherited from [TinyLlama/TinyLlama-1.1B-Chat-v1.0](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0)).