README.md · krisaujla/BitLinear at main

BitLinear / README.md

krisaujla

Upload README.md with huggingface_hub

4cfc447 verified 5 months ago

preview code

raw

history blame contribute delete

10.1 kB

	---
	language:
	- en
	license: mit
	library_name: pytorch
	tags:
	- quantization
	- model-compression
	- bitnet
	- ternary-networks
	- deep-learning
	- pytorch
	- cuda
	- cpp
	- edge-ai
	- efficient-ml
	- low-precision
	- transformer
	pipeline_tag: other
	---

	# BitLinear: Ultra-Low-Precision Linear Layers for PyTorch

	[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
	[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
	[![PyTorch 2.0+](https://img.shields.io/badge/PyTorch-2.0+-ee4c2c.svg)](https://pytorch.org/)

	A production-ready PyTorch implementation of 1.58-bit ternary linear layers that achieves ~19x memory compression while maintaining high accuracy. Drop-in replacement for `nn.Linear` with optimized C++/CUDA kernels.

	## Key Features

	- 19.3x Memory Compression - Near-theoretical maximum (20x)
	- Drop-in Replacement - Same API as `nn.Linear`
	- Optimized Kernels - C++ CPU and CUDA GPU implementations
	- Research-Grade - Based on BitNet and JMLR ternary networks papers
	- Production Ready - Fully tested with comprehensive benchmarks

	## 📊 Performance Highlights

	### Memory Compression

	Achieves 19.23x average compression across various layer sizes:

	\| Layer Size \| nn.Linear \| BitLinear (Packed) \| Compression \|
	\|------------\|-----------\|-------------------\|-------------\|
	\| 512×512 \| 1.00 MB \| 0.05 MB \| 18.6x \|
	\| 1024×1024 \| 4.00 MB \| 0.21 MB \| 19.3x \|
	\| 4096×4096 \| 64.02 MB \| 3.23 MB \| 19.8x \|

	### Real-World Example: GPT-2 Small

	Converting a GPT-2 Small model (12 layers, d_model=768, d_ff=3072):

	- Original: 324 MB
	- BitLinear: 16.8 MB
	- Saved: 307 MB (19.3x compression)

	### Accuracy

	Maintains high output similarity despite extreme quantization:

	- Cosine Similarity: 96.3%
	- Relative Error: ~28%
	- Multi-Ternary (k=3): 75% error reduction vs k=1

	See [BENCHMARKS.md](BENCHMARKS.md) for detailed performance analysis.

	## 🚀 Quick Start

	### Installation

	```bash
	# CPU-only build
	pip install -e .

	# With CUDA support (requires CUDA toolkit)
	CUDA_HOME=/usr/local/cuda pip install -e .
	```

	### Basic Usage

	```python
	import torch
	from bitlinear import BitLinear

	# Create a BitLinear layer (same interface as nn.Linear)
	layer = BitLinear(in_features=512, out_features=1024, bias=True)

	# Forward pass
	x = torch.randn(32, 128, 512)
	output = layer(x) # Same as nn.Linear!

	print(f"Weight values: {torch.unique(layer.W_ternary)}") # [-1, 0, 1]
	```

	### Converting Existing Models

	```python
	import torch.nn as nn
	from bitlinear import convert_linear_to_bitlinear

	# Convert a pre-trained model
	model = nn.TransformerEncoderLayer(d_model=512, nhead=8)
	model_compressed = convert_linear_to_bitlinear(model, inplace=False)

	# Use as normal - all Linear layers are now BitLinear
	x = torch.randn(10, 32, 512)
	output = model_compressed(x)
	```

	### Multi-Ternary for Better Accuracy

	```python
	from bitlinear import MultiTernaryLinear

	# Use k=3 components for 75% error reduction
	layer = MultiTernaryLinear(in_features=512, out_features=1024, k=3)
	```

	## 📖 How It Works

	BitLinear uses ternary quantization to represent weights with only three values: {-1, 0, +1}.

	### Architecture

	1. Quantization: Weights quantized to {-1, 0, +1} using absmax scaling
	2. Scaling: Per-output-channel scaling factors (gamma) compensate for quantization
	3. Packing: Base-3 encoding stores 5 ternary values per byte
	4. Computation: Optimized kernels exploit ternary structure (no multiplications needed)

	### Memory Efficiency

	- Theoretical: log₂(3) ≈ 1.58 bits per weight
	- Actual: 1.6 bits per weight (5 values per byte)
	- Efficiency: 98.8% of theoretical maximum

	## 📁 Project Structure

	```
	BitLinear/
	├── bitlinear/ # Main package
	│ ├── layers.py # BitLinear and MultiTernaryLinear modules
	│ ├── functional.py # Core functional implementations
	│ ├── quantization.py # Ternary quantization utilities
	│ ├── packing.py # Base-3 packing for memory efficiency
	│ └── cpp/ # C++/CUDA extensions
	│ ├── bitlinear.cpp # PyBind11 bindings & CPU kernels
	│ └── bitlinear_kernel.cu # CUDA GPU kernels
	├── tests/ # Comprehensive test suite
	├── examples/ # Usage examples
	│ ├── basic_usage.py # Simple demonstrations
	│ └── transformer_example.py # Transformer integration
	├── benchmarks/ # Performance benchmarks
	│ ├── benchmark_memory.py # Memory analysis
	│ └── benchmark_performance.py # Speed comparison
	└── notebooks/ # Interactive tutorials
	└── demo.md # Step-by-step guide
	```

	## 🧪 Examples

	### Example 1: Basic Layer

	```python
	from bitlinear import BitLinear, estimate_memory_savings

	# Create layer
	layer = BitLinear(512, 1024)

	# Check memory savings
	stats = estimate_memory_savings(512, 1024)
	print(f"Compression: {stats['compression_ratio']:.1f}x") # ~19x
	```

	### Example 2: Transformer Conversion

	```python
	from bitlinear import convert_linear_to_bitlinear

	# Original transformer
	model = nn.TransformerEncoderLayer(d_model=768, nhead=8, dim_feedforward=3072)

	# Convert to BitLinear
	model_bit = convert_linear_to_bitlinear(model)

	# Compare memory
	mem_original = sum(p.numel() * p.element_size() for p in model.parameters()) / 1024**2
	mem_bitlinear = sum(p.numel() * p.element_size() for p in model_bit.parameters()) / 1024**2
	print(f"Memory: {mem_original:.2f} MB → {mem_bitlinear:.2f} MB")
	```

	Run complete examples:

	```bash
	python examples/basic_usage.py
	python examples/transformer_example.py
	```

	## 📈 Benchmarks

	Run benchmarks to see performance on your hardware:

	```bash
	# Memory compression analysis
	python benchmarks/benchmark_memory.py

	# Forward pass performance
	python benchmarks/benchmark_performance.py
	```

	## 🧪 Testing

	Comprehensive test suite with 60+ tests:

	```bash
	# Run all tests
	pytest tests/ -v

	# Run specific test modules
	pytest tests/test_quantization.py -v
	pytest tests/test_layers.py -v
	```

	## 🎓 Research Background

	This implementation is based on:

	- BitNet: [Scaling 1-bit Transformers for Large Language Models](https://arxiv.org/abs/2310.11453)
	- JMLR: [Ternary Representations of Neural Networks](https://jmlr.org/papers/volume26/24-2050/24-2050.pdf)

	### Key Innovations

	1. Ternary Quantization: Reduces weights to {-1, 0, +1}
	2. Absmax Scaling: Per-channel scaling for accuracy
	3. Greedy Decomposition: Multi-ternary for better approximation
	4. Base-3 Packing: Near-optimal memory compression

	## 🛠️ Implementation Details

	### Python Baseline

	Pure PyTorch implementation for correctness and clarity:
	- `bitlinear_python()` - Reference ternary matmul
	- `greedy_ternary_decomposition()` - Multi-component quantization
	- Full gradient support for training

	### C++ Extensions

	Optimized CPU kernels with PyBind11:
	- Ternary-specific optimizations (no multiplications)
	- Efficient memory access patterns
	- Base-3 packing/unpacking

	### CUDA Kernels

	GPU-accelerated implementation:
	- Warp-level reductions using shuffle intrinsics
	- Shared memory tiling
	- Memory coalescing
	- Fused multi-ternary kernels

	## 🎯 Use Cases

	### Ideal For:

	- Edge Deployment: Mobile and embedded devices
	- Large Models: Billion-parameter models with memory constraints
	- Production Inference: Cost-effective serving at scale
	- Research: Exploring ultra-low-precision networks

	### Considerations:

	- Training: Best results with quantization-aware training (QAT)
	- Accuracy: 3-5% accuracy drop typical (acceptable for many tasks)
	- Speed: Python implementation may be slower; use C++/CUDA for production

	## 📚 Documentation

	- [BENCHMARKS.md](BENCHMARKS.md) - Detailed performance analysis
	- [MODEL_CARD.md](MODEL_CARD.md) - HuggingFace model card
	- [notebooks/demo.md](notebooks/demo.md) - Interactive tutorial
	- [read/IMPLEMENTATION_GUIDE.md](read/IMPLEMENTATION_GUIDE.md) - Implementation details (Note can release if needed. Working on extending the pipeline to support future Machine Learning Research)

	## 🤝 Contributing

	Contributions welcome! Areas for improvement:

	- AVX/AVX512 vectorization for CPU
	- Tensor Core utilization for CUDA
	- Additional quantization schemes
	- Training examples and tutorials

	## 📄 License

	MIT License - see [LICENSE](LICENSE) file for details.

	## 📖 Citation

	If you use BitLinear in your research, please cite:

	```bibtex
	@article{jmlr_ternary_2024,
	title={Ternary Representations of Neural Networks},
	journal={Journal of Machine Learning Research},
	volume={26},
	year={2024},
	url={https://jmlr.org/papers/volume26/24-2050/24-2050.pdf}
	}

	@article{bitnet2023,
	title={BitNet: Scaling 1-bit Transformers for Large Language Models},
	author={Wang, Hongyu and Ma, Shuming and Dong, Li and Huang, Shaohan and Wang, Huaijie and Ma, Lingxiao and Yang, Fan and Wang, Ruiping and Wu, Yi and Wei, Furu},
	journal={arXiv preprint arXiv:2310.11453},
	year={2023}
	}
	```

	## 🌟 Acknowledgments

	This implementation builds upon the groundbreaking work in:
	- BitNet by Microsoft Research
	- Ternary Neural Networks research (JMLR)
	- PyTorch's extensibility framework

	## 📞 Contact

	For questions, issues, or collaboration:
	- Open an issue on GitHub
	- Check existing documentation
	- Review examples and benchmarks

	---

	Please tag me if you use this in anything you build. I would love to see what you build with it.

	Made with ❤️ for efficient deep learning