--- license: mit language: - en tags: - bitnet - 1.58-bit - ternary - quantization - code - python - resurrection - manifold-learning pipeline_tag: text-generation --- # Distillix 125M: The Resurrected BitNet **"We didn't train it. We healed it."** This model is a scientific anomaly. It is a **1.58-bit (ternary)** LLM that suffered total weight collapse during training (weights → 0.00). Instead of retraining from scratch, we resurrected it using **Geometric Engineering** based on the *Latent Metric Model (LMM)* theory. ## The Crisis During BitNet training, standard weight decay (L2 regularization) created a catastrophic failure: - **The Zero Trap:** Weights were pushed toward zero by the optimizer - **Ternary Quantization:** Once weights fell below the quantization threshold, they snapped to 0 - **Sticky Death:** Gradients couldn't escape the zero bucket—the neurons died permanently - **Result:** 99.1% of MLP weights were zero. The model was brain-dead. ## The Resurrection Instead of discarding months of training, we applied **manifold physics** to bring the model back: | Phase | Method | Result | |-------|--------|--------| | **1. Time Travel** | Found `model_500steps.pt` (last checkpoint before total collapse) | MLP Std: 0.008 (dying but alive) | | **2. Geometric Engineering** | Wasserstein loss + SVD denoising | Task loss: 1.15 (learned!) but 88% sparse | | **3. Inflation** | Pushed weights FROM zero TO ±0.02 | Sparsity: 88% → 21.5% | | **4. Diagnosis** | Discovered chat-format training data | Model speaks when prompted correctly! | ### The Physics 1. **Polarized Optimizer:** Replaced weight decay with a "double-well potential" that REPELS weights from zero 2. **Three-Peaks Potential:** Enforced clustering at {-S, 0, +S} instead of just "away from zero" 3. **Wasserstein Loss (Syrota et al.):** Aligned the GLOBAL weight distribution to the BitNet lattice using optimal transport 4. **SVD Denoising (Whiteley et al.):** Projected weights onto principal components to remove noise while preserving structure 5. **Manifold Inflation:** Added redundancy back after over-compression to restore robustness ## Model Stats | Metric | Value | |--------|-------| | **Architecture** | Llama-style Transformer | | **Parameters** | 125M | | **Hidden Dim** | 768 | | **Layers** | 12 | | **Heads** | 12 (Query) / 4 (KV) | | **Quantization** | BitNet b1.58 (Ternary: {-1, 0, +1}) | | **Final Sparsity** | 21.5% | | **Weight Distribution** | 29% (-S) / 42% (0) / 29% (+S) | | **MLP Std** | 0.021 (exactly at target!) | | **Task Loss** | ~0.2 | ## Usage The model was resurrected on **chat-format data**, so it expects this prompt structure: ```python prompt = """### User: Write a Python function to calculate fibonacci numbers. ### Assistant: """ ``` ### Example Output ```python Here is a Python function to calculate Fibonacci numbers using recursion: def fibonacci(n): if n <= 0: return 0 elif n == 1: return 1 else: return fibonacci(n-1) + fibonacci(n-2) You can also use an iterative approach for better performance: def fibonacci_iter(n): a, b = 0, 1 for _ in range(n): a, b = b, a + b return a ``` ### Loading the Model ```python import torch from huggingface_hub import hf_hub_download # Download the resurrected model path = hf_hub_download( repo_id="rileyseaburg/distillix", filename="inflation/inflation-2000.pt" ) # Load checkpoint ckpt = torch.load(path, map_location='cpu') state_dict = ckpt['model_state_dict'] # Load into your model architecture model.load_state_dict(state_dict) ``` ## Files | File | Description | |------|-------------| | `model_500steps.pt` | The "Time Machine" - last healthy checkpoint before collapse | | `geometric/geometric-*.pt` | Checkpoints from Wasserstein+SVD training | | `inflation/inflation-2000.pt` | **THE FINAL MODEL** - fully resurrected | ## The Journey (TL;DR) ``` Step 500: MLP Std = 0.008 (Dying) Step 2000: MLP Std = 0.000 (Dead - 99% zeros) ↓ [GEOMETRIC ENGINEERING] ↓ Geometric: Task Loss = 1.15 (Learned! But 88% sparse) ↓ [INFLATION] ↓ Final: MLP Std = 0.021 (Alive!) Sparsity = 21.5% (Dense!) Distribution = 29/42/29 (Balanced!) OUTPUT: "Here is a Python function to calculate Fibonacci..." ``` ## Why This Matters 1. **Dead models can be resurrected** - You don't have to throw away collapsed checkpoints 2. **Manifold geometry is real** - The LMM theory predicted this would work, and it did 3. **BitNet needs special optimizers** - Standard weight decay is lethal for ternary networks 4. **Physics > Brute Force** - We healed the model with math, not more compute ## Theoretical Foundation - **Whiteley et al. (2025):** "Statistical Exploration of the Manifold Hypothesis" - Theorem 1 proves signal lives in principal components - **Syrota et al. (2025):** "Metric Identifiability in Latent Models" - Theorem 4.7 proves metric structure is recoverable from distribution ## Credits - **Engineering & Resurrection:** Riley Seaburg - **Theoretical Framework:** Whiteley, Gray, Rubin-Delanchy (LMM), Syrota et al. (Metric Identifiability) - **Original BitNet:** Microsoft Research ## License MIT --- *"The model was dead. We didn't retrain it. We performed surgery on its soul."*