---
title: BERT Compression POC
emoji: 🚀
colorFrom: blue
colorTo: gray
sdk: gradio
sdk_version: "5.0.0"
python_version: "3.10.13"
app_file: app.py
pinned: false
---

# 🔬 BERT Compression POC

A hands-on demonstration of **ML model compression** applied to a pretrained DistilBERT sentiment classifier.

Covers three core compression concepts: **Quantization**, with benchmarks comparing FP32 baseline vs INT8 quantized model across accuracy, inference speed, and model size.

---

## What This Project Does

Takes a pretrained DistilBERT model fine-tuned on SST-2 sentiment analysis, applies **Dynamic INT8 Post-Training Quantization** using PyTorch, and measures the real-world impact across three dimensions:

| Metric | Baseline (FP32) | Quantized (INT8) |
|---|---|---|
| Model Size | ~255 MB | ~65 MB |
| Avg Inference Time | ~120 ms | ~40 ms |
| Accuracy (SST-2) | ~91% | ~90.3% |
| Speedup | — | ~3x faster |
| Size Reduction | — | ~75% smaller |

> *Actual numbers will vary slightly based on hardware. Run `compare.py` to see your results.*

---

## Concepts Demonstrated

**Quantization** — Reducing model weight precision from FP32 (32-bit float) to INT8 (8-bit integer). Four times fewer bytes per parameter. Runs faster on hardware with integer arithmetic units (most modern CPUs and mobile NPUs).

**Dynamic vs Static Quantization** — This POC uses dynamic quantization, which is the standard approach for transformer/NLP models. Weights are pre-quantized; activations are quantized at runtime. No calibration dataset required.

**Accuracy-Efficiency Trade-off** — Demonstrates that aggressive compression (4x size reduction) results in less than 1% accuracy degradation — the core insight from the model compression literature.

---

## Project Structure

```
bert-compression-poc/
├── src/
│   ├── load_model.py     # Load DistilBERT from HuggingFace
│   ├── evaluate.py       # Accuracy + inference time measurement
│   ├── quantize.py       # INT8 quantization pipeline
│   └── compare.py        # Side-by-side results table
├── models/
│   ├── baseline/         # Saved FP32 model
│   └── quantized/        # Saved INT8 model
├── results/
│   └── comparison.json   # Benchmark output
├── notebook/
│   └── demo.ipynb        # Full walkthrough notebook
├── app.py                # Gradio UI — live comparison
├── requirements.txt
└── README.md
```

---

## Setup

**Requirements:** Python 3.10+, pip

```bash
# Clone the repo
git clone https://github.com/YOUR_USERNAME/bert-compression-poc.git
cd bert-compression-poc

# Install dependencies
pip install -r requirements.txt
```

---

## How to Run

**Step 1 — Download and save the baseline model**
```bash
python src/load_model.py
```

**Step 2 — Apply INT8 quantization**
```bash
python src/quantize.py
```

**Step 3 — Run the benchmark comparison**
```bash
python src/compare.py
```

**Step 4 — Launch the Gradio UI**
```bash
python app.py
# Open http://localhost:7860
```

**Or — open the notebook for a full walkthrough**
```bash
jupyter notebook notebook/demo.ipynb
```

---

## Key Findings

- Dynamic INT8 quantization achieves **~3x inference speedup** on CPU with less than **1% accuracy drop**
- Model size reduces from ~255 MB to ~65 MB — a **~75% reduction**
- No retraining required — compression is applied post-training in under 30 seconds
- PyTorch's `quantize_dynamic` supports this natively with 3 lines of code

---

## Tech Stack

- **PyTorch** — quantization engine
- **HuggingFace Transformers** — DistilBERT model and tokenizer
- **HuggingFace Datasets** — SST-2 evaluation data
- **Gradio** — interactive comparison UI

---

## Further Reading

- Hinton et al. (2015) — Knowledge Distillation: https://arxiv.org/abs/1503.02531
- Jacob et al. (2018) — Quantization for efficient inference: https://arxiv.org/abs/1712.05877
- Sanh et al. (2019) — DistilBERT: https://arxiv.org/abs/1910.01108
- PyTorch Quantization Docs: https://pytorch.org/docs/stable/quantization.html

---

*Built as part of MCA (AI/ML Specialisation) minor project research on ML model compression for edge deployment.*