---
language: en
license: apache-2.0
base_model: TinyLlama/TinyLlama-1.1B-Chat-v1.0
tags:
  - llama
  - transformer
  - compressed
  - hxq
  - helix-substrate
  - vector-quantization
  - helixcode
library_name: transformers
pipeline_tag: text-generation
model-index:
  - name: tinyllama-1.1b-helix
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: WikiText-2
          type: wikitext
          config: wikitext-2-raw-v1
          split: test
        metrics:
          - type: perplexity
            value: 6.220
            name: Perplexity
            verified: true
---

# TinyLlama-1.1B-HXQ

> **3.99x smaller. +0.78% perplexity. The fidelity reference.**
>
> TinyLlama-1.1B compressed from 4.4 GB (FP32) to 1.03 GB with the tightest PPL delta in the lineup. No calibration data. No architecture-specific tuning. Just `pip install` and `from_pretrained()`.

## Install and Run

```bash
pip install "helix-substrate[hf]"
```

```python
import helix_substrate  # registers the HXQ quantizer with HuggingFace
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("EchoLabs33/tinyllama-1.1b-helix")
tokenizer = AutoTokenizer.from_pretrained("EchoLabs33/tinyllama-1.1b-helix")

inputs = tokenizer("The meaning of life is", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=64)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

That's it. `import helix_substrate` registers the quantizer. `from_pretrained()` handles the rest automatically.

## Benchmark

| | Dense (FP32) | HXQ |
|---|---|---|
| **Size** | 4.4 GB | **1.03 GB** |
| **Perplexity** (WikiText-2) | 6.172 | 6.220 **(+0.78%)** |
| **Compression ratio** | — | **3.99x** |
| **Compressed modules** | — | 154 HelixLinear + 1 nn.Linear (lm_head) |
| **Architecture** | LLaMA (22 layers, GQA) | unchanged |

Eval: WikiText-2 test split, 2048 tokens, stride 512.

## Good to Know

- **GPU and CPU supported** — runs on any CUDA GPU or CPU via standard PyTorch. Fused kernels for additional speedup are in progress.
- **Fine-tunable via LoRA** — compressed weights remain frozen, but LoRA adapters attach to each `HelixLinear` layer via `HelixLinearSTE`. See `helix-substrate` for training infrastructure.
- **Requires `helix-substrate`** — the quantizer is not built into transformers. You need `pip install "helix-substrate[hf]"`.

## What is HelixCode?

HelixCode is a universal weight compression codec based on vector quantization:

- Each weight matrix is replaced by a **256-entry codebook** (float32) + **uint8 index matrix** + optional **sidecar corrections** for outlier values
- The compressed form *is* the executable — `HelixLinear` performs `codebook[indices] @ x` directly, no decompression step
- Works on any `nn.Linear` regardless of architecture (Transformer, Mamba, MLP, CNN)
- **No calibration data required** — unlike GPTQ/AWQ, codebooks are fit from the weights alone

## How It Works

1. `import helix_substrate` registers the `hxq` quantizer with HuggingFace
2. `from_pretrained()` reads `quantization_config.quant_method = "hxq"` from `config.json`
3. The quantizer replaces 154 `nn.Linear` modules with `HelixLinear` shells before weight loading
4. Safetensors populates the codebook, indices, and sidecar buffers directly
5. The model runs in compressed form — no decompression needed

## Why TinyLlama?

This is the **fidelity benchmark** — at +0.78% PPL, it demonstrates that HelixCode compression introduces negligible degradation on a well-studied reference model. TinyLlama's weights are well-conditioned (low kurtosis), making it the ideal validation target.

## Compression Receipt

```
Compressed tensors:  156
Exact tensors (npy): 45   (norms, embeddings)
From original model: 44
Total keys:          753
Output size:         1,053 MB
Weight ratio:        3.99x
PPL delta:           +0.78% (6.220 vs 6.172 dense)
Eval: WikiText-2 test, 2048 tokens, stride=512
```

## Companion Models

Same codec, same `pip install`, multiple architectures:

| Model | Architecture | Ratio | PPL Delta |
|-------|-------------|-------|-----------|
| [qwen2.5-14b-instruct-helix](https://huggingface.co/EchoLabs33/qwen2.5-14b-instruct-helix) | Transformer | 3.4x | pending |
| [qwen2.5-7b-instruct-helix](https://huggingface.co/EchoLabs33/qwen2.5-7b-instruct-helix) | Transformer | 2.2x | +6.34% |
| [qwen2.5-3b-instruct-helix](https://huggingface.co/EchoLabs33/qwen2.5-3b-instruct-helix) | Transformer | 1.6x | +0.69% |
| [qwen2.5-coder-3b-helix](https://huggingface.co/EchoLabs33/qwen2.5-coder-3b-helix) | Transformer (code) | 1.6x | +1.92% |
| [qwen2.5-coder-1.5b-instruct-helix](https://huggingface.co/EchoLabs33/qwen2.5-coder-1.5b-instruct-helix) | Transformer (code) | 2.4x | +1.63% |
| [zamba2-2.7b-instruct-helix](https://huggingface.co/EchoLabs33/zamba2-2.7b-instruct-helix) | Hybrid (Mamba2+Transformer) | 1.8x | +6.59% |
| [zamba2-1.2b-helix](https://huggingface.co/EchoLabs33/zamba2-1.2b-helix) | Hybrid (Mamba2+Transformer) | 1.7x | +2.90% |
| [mamba2-1.3b-helix](https://huggingface.co/EchoLabs33/mamba2-1.3b-helix) | Pure SSM (Mamba2) | 2.1x | +8.0% |
| [mamba-130m-helix](https://huggingface.co/EchoLabs33/mamba-130m-helix) | Pure SSM | 3.8x | +18.4% |

## Citation

```bibtex
@software{helix_substrate_2026,
  title={Helix Substrate: Universal Weight Compression via HelixCode},
  author={EchoLabs},
  year={2026},
  url={https://github.com/echo313unfolding/helix-substrate}
}
```

## License

Apache 2.0 (inherited from [TinyLlama/TinyLlama-1.1B-Chat-v1.0](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0)).