--- title: BERT Compression POC emoji: 🚀 colorFrom: blue colorTo: gray sdk: gradio sdk_version: "5.0.0" python_version: "3.10.13" app_file: app.py pinned: false --- # 🔬 BERT Compression POC A hands-on demonstration of **ML model compression** applied to a pretrained DistilBERT sentiment classifier. Covers three core compression concepts: **Quantization**, with benchmarks comparing FP32 baseline vs INT8 quantized model across accuracy, inference speed, and model size. --- ## What This Project Does Takes a pretrained DistilBERT model fine-tuned on SST-2 sentiment analysis, applies **Dynamic INT8 Post-Training Quantization** using PyTorch, and measures the real-world impact across three dimensions: | Metric | Baseline (FP32) | Quantized (INT8) | |---|---|---| | Model Size | ~255 MB | ~65 MB | | Avg Inference Time | ~120 ms | ~40 ms | | Accuracy (SST-2) | ~91% | ~90.3% | | Speedup | — | ~3x faster | | Size Reduction | — | ~75% smaller | > *Actual numbers will vary slightly based on hardware. Run `compare.py` to see your results.* --- ## Concepts Demonstrated **Quantization** — Reducing model weight precision from FP32 (32-bit float) to INT8 (8-bit integer). Four times fewer bytes per parameter. Runs faster on hardware with integer arithmetic units (most modern CPUs and mobile NPUs). **Dynamic vs Static Quantization** — This POC uses dynamic quantization, which is the standard approach for transformer/NLP models. Weights are pre-quantized; activations are quantized at runtime. No calibration dataset required. **Accuracy-Efficiency Trade-off** — Demonstrates that aggressive compression (4x size reduction) results in less than 1% accuracy degradation — the core insight from the model compression literature. --- ## Project Structure ``` bert-compression-poc/ ├── src/ │ ├── load_model.py # Load DistilBERT from HuggingFace │ ├── evaluate.py # Accuracy + inference time measurement │ ├── quantize.py # INT8 quantization pipeline │ └── compare.py # Side-by-side results table ├── models/ │ ├── baseline/ # Saved FP32 model │ └── quantized/ # Saved INT8 model ├── results/ │ └── comparison.json # Benchmark output ├── notebook/ │ └── demo.ipynb # Full walkthrough notebook ├── app.py # Gradio UI — live comparison ├── requirements.txt └── README.md ``` --- ## Setup **Requirements:** Python 3.10+, pip ```bash # Clone the repo git clone https://github.com/YOUR_USERNAME/bert-compression-poc.git cd bert-compression-poc # Install dependencies pip install -r requirements.txt ``` --- ## How to Run **Step 1 — Download and save the baseline model** ```bash python src/load_model.py ``` **Step 2 — Apply INT8 quantization** ```bash python src/quantize.py ``` **Step 3 — Run the benchmark comparison** ```bash python src/compare.py ``` **Step 4 — Launch the Gradio UI** ```bash python app.py # Open http://localhost:7860 ``` **Or — open the notebook for a full walkthrough** ```bash jupyter notebook notebook/demo.ipynb ``` --- ## Key Findings - Dynamic INT8 quantization achieves **~3x inference speedup** on CPU with less than **1% accuracy drop** - Model size reduces from ~255 MB to ~65 MB — a **~75% reduction** - No retraining required — compression is applied post-training in under 30 seconds - PyTorch's `quantize_dynamic` supports this natively with 3 lines of code --- ## Tech Stack - **PyTorch** — quantization engine - **HuggingFace Transformers** — DistilBERT model and tokenizer - **HuggingFace Datasets** — SST-2 evaluation data - **Gradio** — interactive comparison UI --- ## Further Reading - Hinton et al. (2015) — Knowledge Distillation: https://arxiv.org/abs/1503.02531 - Jacob et al. (2018) — Quantization for efficient inference: https://arxiv.org/abs/1712.05877 - Sanh et al. (2019) — DistilBERT: https://arxiv.org/abs/1910.01108 - PyTorch Quantization Docs: https://pytorch.org/docs/stable/quantization.html --- *Built as part of MCA (AI/ML Specialisation) minor project research on ML model compression for edge deployment.*