--- license: apache-2.0 base_model: Qwen/Qwen2.5-Coder-14B-Instruct tags: - bitnet - quantization - ternary - 1.58-bit - qwen - qwen2.5 - code - experimental - 14b-architecture library_name: safetensors pipeline_tag: text-generation language: - en - zh model_name: Qwen2.5-Coder-14B-BitNet-1.58b datasets: [] metrics: [] --- # Qwen2.5-Coder-14B-Instruct-BitNet-1.58b **Architecture: 14.7 Billion Parameters** | BitNet 1.58-bit Ternary Quantization --- > **IMPORTANT: Parameter Count Display** > > HuggingFace displays a reduced parameter count because it counts packed bytes, not actual parameters. > This model has the **full 14.7B parameter Qwen2.5-Coder architecture**. > The weights are stored as ternary values ({-1, 0, +1}) packed 4 per byte, which reduces > storage to 4.6 GB but preserves all 14.7 billion parameters. --- ## Overview This is an **experimental** BitNet 1.58-bit quantization of the Qwen2.5-Coder-14B-Instruct model using absmean scaling with group-wise quantization. The model stores weights as ternary values ({-1, 0, +1}) packed 4 values per byte. **This is research/experimental work. Quality and performance have not been formally benchmarked.** ## Specifications | Property | Value | |----------|-------| | Base Model | [Qwen/Qwen2.5-Coder-14B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-14B-Instruct) | | Architecture | Qwen2 (Qwen2ForCausalLM) | | Parameters | 14.7B | | Quantization | BitNet 1.58-bit ternary | | Bits per Weight | ~1.58 | | Group Size | 64 | | Original Size | 29.54 GB (BF16) | | Quantized Size | 4.62 GB (SafeTensors) | | GGUF Size | 6.52 GB (TQ2_0) | | Compression | ~6.4x | ## Formats | Format | File | Description | |--------|------|-------------| | SafeTensors | `model-*.safetensors` | Sharded quantized weights + scales | | GGUF | `qwen-coder-14b-tq2.gguf` | llama.cpp compatible | ## Quantization Method ### Algorithm 1. Reshape weights into groups of 64 2. Compute per-group scale: `scale = mean(|weights|)` 3. Normalize and round to nearest ternary: `q = round(w / scale)` clamped to {-1, 0, +1} 4. Map to unsigned: {-1, 0, +1} → {0, 1, 2} 5. Pack 4 values per byte: `v0 + v1*3 + v2*9 + v3*27` ### Tooling - **Quantization**: Custom Rust tool using [Candle](https://github.com/huggingface/candle) - **GGUF Conversion**: [llama.cpp](https://github.com/ggerganov/llama.cpp) convert_hf_to_gguf.py ### Hardware Used - GPU: NVIDIA RTX 5080 (16GB VRAM) - Quantization time: ~99 seconds - Memory: Streaming mode with CPU fallback for large tensors ## Usage ### With Ollama/llama.cpp ```bash # llama.cpp ./llama-cli -m qwen-coder-14b-tq2.gguf -p "Write a Python function:" ``` ### Unpacking Weights (Python) ```python def unpack_ternary(packed_byte): """Unpack 4 ternary values from byte.""" values = [] val = packed_byte for _ in range(4): values.append((val % 3) - 1) # {0,1,2} → {-1,0,+1} val //= 3 return values ``` ## Limitations - **Quality not benchmarked** - May have significant degradation vs original - **Requires custom runtime** - Standard transformers doesn't support ternary weights - **Experimental** - Not intended for production use without evaluation - GGUF keeps embeddings/lm_head at F16, hence larger than SafeTensors ## License Apache 2.0 (inherited from [Qwen2.5-Coder-14B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-14B-Instruct)) ## Citation ```bibtex @misc{qwen-coder-bitnet-2025, title={Qwen2.5-Coder-14B-BitNet-1.58b: Experimental BitNet Quantization}, author={Tzervas}, year={2025}, url={https://huggingface.co/tzervas/qwen2.5-coder-14b-bitnet-1.58b} } ```