| ---
|
| language:
|
| - en
|
| license: mit
|
| library_name: pytorch
|
| tags:
|
| - quantization
|
| - model-compression
|
| - bitnet
|
| - ternary-networks
|
| - deep-learning
|
| - pytorch
|
| - cuda
|
| - cpp
|
| - edge-ai
|
| - efficient-ml
|
| - low-precision
|
| - transformer
|
| pipeline_tag: other
|
| ---
|
|
|
| # BitLinear: Ultra-Low-Precision Linear Layers for PyTorch
|
|
|
| [](https://opensource.org/licenses/MIT)
|
| [](https://www.python.org/downloads/)
|
| [](https://pytorch.org/)
|
|
|
| A production-ready PyTorch implementation of **1.58-bit ternary linear layers** that achieves **~19x memory compression** while maintaining high accuracy. Drop-in replacement for `nn.Linear` with optimized C++/CUDA kernels.
|
|
|
| ## Key Features
|
|
|
| - **19.3x Memory Compression** - Near-theoretical maximum (20x)
|
| - **Drop-in Replacement** - Same API as `nn.Linear`
|
| - **Optimized Kernels** - C++ CPU and CUDA GPU implementations
|
| - **Research-Grade** - Based on BitNet and JMLR ternary networks papers
|
| - **Production Ready** - Fully tested with comprehensive benchmarks
|
|
|
| ## π Performance Highlights
|
|
|
| ### Memory Compression
|
|
|
| Achieves **19.23x average compression** across various layer sizes:
|
|
|
| | Layer Size | nn.Linear | BitLinear (Packed) | Compression |
|
| |------------|-----------|-------------------|-------------|
|
| | 512Γ512 | 1.00 MB | 0.05 MB | **18.6x** |
|
| | 1024Γ1024 | 4.00 MB | 0.21 MB | **19.3x** |
|
| | 4096Γ4096 | 64.02 MB | 3.23 MB | **19.8x** |
|
|
|
| ### Real-World Example: GPT-2 Small
|
|
|
| Converting a GPT-2 Small model (12 layers, d_model=768, d_ff=3072):
|
|
|
| - **Original:** 324 MB
|
| - **BitLinear:** 16.8 MB
|
| - **Saved:** 307 MB (19.3x compression)
|
|
|
| ### Accuracy
|
|
|
| Maintains high output similarity despite extreme quantization:
|
|
|
| - **Cosine Similarity:** 96.3%
|
| - **Relative Error:** ~28%
|
| - **Multi-Ternary (k=3):** 75% error reduction vs k=1
|
|
|
| See [BENCHMARKS.md](BENCHMARKS.md) for detailed performance analysis.
|
|
|
| ## π Quick Start
|
|
|
| ### Installation
|
|
|
| ```bash
|
| # CPU-only build
|
| pip install -e .
|
|
|
| # With CUDA support (requires CUDA toolkit)
|
| CUDA_HOME=/usr/local/cuda pip install -e .
|
| ```
|
|
|
| ### Basic Usage
|
|
|
| ```python
|
| import torch
|
| from bitlinear import BitLinear
|
|
|
| # Create a BitLinear layer (same interface as nn.Linear)
|
| layer = BitLinear(in_features=512, out_features=1024, bias=True)
|
|
|
| # Forward pass
|
| x = torch.randn(32, 128, 512)
|
| output = layer(x) # Same as nn.Linear!
|
|
|
| print(f"Weight values: {torch.unique(layer.W_ternary)}") # [-1, 0, 1]
|
| ```
|
|
|
| ### Converting Existing Models
|
|
|
| ```python
|
| import torch.nn as nn
|
| from bitlinear import convert_linear_to_bitlinear
|
|
|
| # Convert a pre-trained model
|
| model = nn.TransformerEncoderLayer(d_model=512, nhead=8)
|
| model_compressed = convert_linear_to_bitlinear(model, inplace=False)
|
|
|
| # Use as normal - all Linear layers are now BitLinear
|
| x = torch.randn(10, 32, 512)
|
| output = model_compressed(x)
|
| ```
|
|
|
| ### Multi-Ternary for Better Accuracy
|
|
|
| ```python
|
| from bitlinear import MultiTernaryLinear
|
|
|
| # Use k=3 components for 75% error reduction
|
| layer = MultiTernaryLinear(in_features=512, out_features=1024, k=3)
|
| ```
|
|
|
| ## π How It Works
|
|
|
| BitLinear uses **ternary quantization** to represent weights with only three values: {-1, 0, +1}.
|
|
|
| ### Architecture
|
|
|
| 1. **Quantization:** Weights quantized to {-1, 0, +1} using absmax scaling
|
| 2. **Scaling:** Per-output-channel scaling factors (gamma) compensate for quantization
|
| 3. **Packing:** Base-3 encoding stores 5 ternary values per byte
|
| 4. **Computation:** Optimized kernels exploit ternary structure (no multiplications needed)
|
|
|
| ### Memory Efficiency
|
|
|
| - **Theoretical:** logβ(3) β 1.58 bits per weight
|
| - **Actual:** 1.6 bits per weight (5 values per byte)
|
| - **Efficiency:** 98.8% of theoretical maximum
|
|
|
| ## π Project Structure
|
|
|
| ```
|
| BitLinear/
|
| βββ bitlinear/ # Main package
|
| β βββ layers.py # BitLinear and MultiTernaryLinear modules
|
| β βββ functional.py # Core functional implementations
|
| β βββ quantization.py # Ternary quantization utilities
|
| β βββ packing.py # Base-3 packing for memory efficiency
|
| β βββ cpp/ # C++/CUDA extensions
|
| β βββ bitlinear.cpp # PyBind11 bindings & CPU kernels
|
| β βββ bitlinear_kernel.cu # CUDA GPU kernels
|
| βββ tests/ # Comprehensive test suite
|
| βββ examples/ # Usage examples
|
| β βββ basic_usage.py # Simple demonstrations
|
| β βββ transformer_example.py # Transformer integration
|
| βββ benchmarks/ # Performance benchmarks
|
| β βββ benchmark_memory.py # Memory analysis
|
| β βββ benchmark_performance.py # Speed comparison
|
| βββ notebooks/ # Interactive tutorials
|
| βββ demo.md # Step-by-step guide
|
| ```
|
|
|
| ## π§ͺ Examples
|
|
|
| ### Example 1: Basic Layer
|
|
|
| ```python
|
| from bitlinear import BitLinear, estimate_memory_savings
|
|
|
| # Create layer
|
| layer = BitLinear(512, 1024)
|
|
|
| # Check memory savings
|
| stats = estimate_memory_savings(512, 1024)
|
| print(f"Compression: {stats['compression_ratio']:.1f}x") # ~19x
|
| ```
|
|
|
| ### Example 2: Transformer Conversion
|
|
|
| ```python
|
| from bitlinear import convert_linear_to_bitlinear
|
|
|
| # Original transformer
|
| model = nn.TransformerEncoderLayer(d_model=768, nhead=8, dim_feedforward=3072)
|
|
|
| # Convert to BitLinear
|
| model_bit = convert_linear_to_bitlinear(model)
|
|
|
| # Compare memory
|
| mem_original = sum(p.numel() * p.element_size() for p in model.parameters()) / 1024**2
|
| mem_bitlinear = sum(p.numel() * p.element_size() for p in model_bit.parameters()) / 1024**2
|
| print(f"Memory: {mem_original:.2f} MB β {mem_bitlinear:.2f} MB")
|
| ```
|
|
|
| Run complete examples:
|
|
|
| ```bash
|
| python examples/basic_usage.py
|
| python examples/transformer_example.py
|
| ```
|
|
|
| ## π Benchmarks
|
|
|
| Run benchmarks to see performance on your hardware:
|
|
|
| ```bash
|
| # Memory compression analysis
|
| python benchmarks/benchmark_memory.py
|
|
|
| # Forward pass performance
|
| python benchmarks/benchmark_performance.py
|
| ```
|
|
|
| ## π§ͺ Testing
|
|
|
| Comprehensive test suite with 60+ tests:
|
|
|
| ```bash
|
| # Run all tests
|
| pytest tests/ -v
|
|
|
| # Run specific test modules
|
| pytest tests/test_quantization.py -v
|
| pytest tests/test_layers.py -v
|
| ```
|
|
|
| ## π Research Background
|
|
|
| This implementation is based on:
|
|
|
| - **BitNet:** [Scaling 1-bit Transformers for Large Language Models](https://arxiv.org/abs/2310.11453)
|
| - **JMLR:** [Ternary Representations of Neural Networks](https://jmlr.org/papers/volume26/24-2050/24-2050.pdf)
|
|
|
| ### Key Innovations
|
|
|
| 1. **Ternary Quantization:** Reduces weights to {-1, 0, +1}
|
| 2. **Absmax Scaling:** Per-channel scaling for accuracy
|
| 3. **Greedy Decomposition:** Multi-ternary for better approximation
|
| 4. **Base-3 Packing:** Near-optimal memory compression
|
|
|
| ## π οΈ Implementation Details
|
|
|
| ### Python Baseline
|
|
|
| Pure PyTorch implementation for correctness and clarity:
|
| - `bitlinear_python()` - Reference ternary matmul
|
| - `greedy_ternary_decomposition()` - Multi-component quantization
|
| - Full gradient support for training
|
|
|
| ### C++ Extensions
|
|
|
| Optimized CPU kernels with PyBind11:
|
| - Ternary-specific optimizations (no multiplications)
|
| - Efficient memory access patterns
|
| - Base-3 packing/unpacking
|
|
|
| ### CUDA Kernels
|
|
|
| GPU-accelerated implementation:
|
| - Warp-level reductions using shuffle intrinsics
|
| - Shared memory tiling
|
| - Memory coalescing
|
| - Fused multi-ternary kernels
|
|
|
| ## π― Use Cases
|
|
|
| ### Ideal For:
|
|
|
| - **Edge Deployment:** Mobile and embedded devices
|
| - **Large Models:** Billion-parameter models with memory constraints
|
| - **Production Inference:** Cost-effective serving at scale
|
| - **Research:** Exploring ultra-low-precision networks
|
|
|
| ### Considerations:
|
|
|
| - **Training:** Best results with quantization-aware training (QAT)
|
| - **Accuracy:** 3-5% accuracy drop typical (acceptable for many tasks)
|
| - **Speed:** Python implementation may be slower; use C++/CUDA for production
|
|
|
| ## π Documentation
|
|
|
| - **[BENCHMARKS.md](BENCHMARKS.md)** - Detailed performance analysis
|
| - **[MODEL_CARD.md](MODEL_CARD.md)** - HuggingFace model card
|
| - **[notebooks/demo.md](notebooks/demo.md)** - Interactive tutorial
|
| - **[read/IMPLEMENTATION_GUIDE.md](read/IMPLEMENTATION_GUIDE.md)** - Implementation details (Note can release if needed. Working on extending the pipeline to support future Machine Learning Research)
|
|
|
| ## π€ Contributing
|
|
|
| Contributions welcome! Areas for improvement:
|
|
|
| - AVX/AVX512 vectorization for CPU
|
| - Tensor Core utilization for CUDA
|
| - Additional quantization schemes
|
| - Training examples and tutorials
|
|
|
| ## π License
|
|
|
| MIT License - see [LICENSE](LICENSE) file for details.
|
|
|
| ## π Citation
|
|
|
| If you use BitLinear in your research, please cite:
|
|
|
| ```bibtex
|
| @article{jmlr_ternary_2024,
|
| title={Ternary Representations of Neural Networks},
|
| journal={Journal of Machine Learning Research},
|
| volume={26},
|
| year={2024},
|
| url={https://jmlr.org/papers/volume26/24-2050/24-2050.pdf}
|
| }
|
|
|
| @article{bitnet2023,
|
| title={BitNet: Scaling 1-bit Transformers for Large Language Models},
|
| author={Wang, Hongyu and Ma, Shuming and Dong, Li and Huang, Shaohan and Wang, Huaijie and Ma, Lingxiao and Yang, Fan and Wang, Ruiping and Wu, Yi and Wei, Furu},
|
| journal={arXiv preprint arXiv:2310.11453},
|
| year={2023}
|
| }
|
| ```
|
|
|
| ## π Acknowledgments
|
|
|
| This implementation builds upon the groundbreaking work in:
|
| - BitNet by Microsoft Research
|
| - Ternary Neural Networks research (JMLR)
|
| - PyTorch's extensibility framework
|
|
|
| ## π Contact
|
|
|
| For questions, issues, or collaboration:
|
| - Open an issue on GitHub
|
| - Check existing documentation
|
| - Review examples and benchmarks
|
|
|
| ---
|
|
|
| Please tag me if you use this in anything you build. I would love to see what you build with it.
|
|
|
| Made with β€οΈ for efficient deep learning
|
|
|