---
language:
- en
license: llama3.1
tags:
- llama
- llama-3
- llama-3.1
- instruct
- int8
- quantized
- vllm
- llm-compressor
- w8a8
- per-token
- per-channel
- dynamic-quantization
base_model: meta-llama/Llama-3.1-8B-Instruct
---

# Llama-3.1-8B-Instruct-INT8-W8A8-Dynamic-Per-Token

This is an INT8 W8A8 quantized version of [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) created using [llm-compressor](https://github.com/vllm-project/llm-compressor).

**Note**: This model quantizes **Weights and Activations only**. KV cache is **NOT** quantized.

## Quantization Details

- **Quantization Method**: INT8 W8A8 (Weight and Activation only)
- **Weight Precision**: INT8 (8-bit integer), static per-channel quantization
  - Scale shape: `(N, 1)` — one scale per output channel
  - Observer: MinMax
- **Activation Precision**: INT8 (8-bit integer), dynamic per-token quantization
  - Scale computed at runtime: `absmax / 127.0` per token (row)
  - No activation scales stored in checkpoint
- **KV Cache**: **Not quantized** (remains in original precision)
- **Quantization Format**: `compressed-tensors` (`int-quantized`)
- **Ignored Layers**: `lm_head` only
- **Calibration Dataset**: CNN/DailyMail
- **Calibration Samples**: 512

## vLLM CUTLASS W8A8 Kernel

This model is optimized for the vLLM CUTLASS W8A8 INT8 kernel, which fuses dequantization into the GEMM epilogue:

```
D[m,n] = a_scale[m] * b_scale[n] * int32_accum[m,n]
```

- `a_scale[m]`: per-token activation scale (computed dynamically at runtime)
- `b_scale[n]`: per-channel weight scale (stored in checkpoint)
- `int32_accum[m,n]`: INT8 x INT8 accumulated in INT32

## Model Size

- **Original Model**: ~16GB (FP16)
- **Quantized Model**: ~8.5GB (INT8 W8A8)
- **Compression Ratio**: ~1.9x

## Usage

### Installation

```bash
pip install vllm>=0.6.0
```

### With vLLM

```python
from vllm import LLM, SamplingParams

# Load the INT8 W8A8 quantized model
llm = LLM(
    model="JongYeop/Llama-3.1-8B-Instruct-INT8-W8A8-Dynamic-Per-Token",
)

# Generate text
prompts = ["Hello, my name is"]
sampling_params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=100)
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(output.outputs[0].text)
```

### With Transformers (for inspection)

```python
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("JongYeop/Llama-3.1-8B-Instruct-INT8-W8A8-Dynamic-Per-Token")
model = AutoModelForCausalLM.from_pretrained(
    "JongYeop/Llama-3.1-8B-Instruct-INT8-W8A8-Dynamic-Per-Token",
    device_map="auto"
)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the capital of France?"}
]

input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
output = model.generate(input_ids, max_new_tokens=100)
print(tokenizer.decode(output[0], skip_special_tokens=True))
```

## Performance

INT8 W8A8 Dynamic Per-Token quantization provides:
- **~2x memory reduction** compared to FP16
- **Faster inference** with INT8 GEMM kernels on modern GPUs
- **Better accuracy than per-tensor** due to fine-grained per-token activation scaling
- **Per-channel weight quantization** preserves weight distribution per output channel

## Quantization Recipe

The quantization recipe used for this model is included in the repository as `recipe.yaml`.

Key configuration:
```yaml
quant_stage:
  quant_modifiers:
    QuantizationModifier:
      ignore: ["lm_head"]
      config_groups:
        group_0:
          weights:
            num_bits: 8
            type: int
            strategy: channel     # Per-channel (one scale per output channel)
            dynamic: false
            symmetric: true
          input_activations:
            num_bits: 8
            type: int
            strategy: token       # Per-token (one scale per row)
            dynamic: true         # Scales computed at runtime
            symmetric: true
          targets: ["Linear"]
```

## Hardware Requirements

- **GPU**: NVIDIA GPU with INT8 Tensor Core support (Turing and later)
  - Examples: RTX 2080 Ti, RTX 3090, RTX 4090, A100, H100
- **VRAM**: Minimum 10GB for inference

## Citation

If you use this model, please cite:

```bibtex
@software{llm-compressor,
  title = {LLM Compressor},
  author = {vLLM Team},
  url = {https://github.com/vllm-project/llm-compressor},
  year = {2024}
}

@article{llama3,
  title={Llama 3 Model Card},
  author={AI@Meta},
  year={2024},
  url={https://github.com/meta-llama/llama3}
}
```

## License

This model inherits the license from the original [Llama 3.1](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) model.

## Acknowledgments

- Original model: [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)
- Quantization tool: [llm-compressor](https://github.com/vllm-project/llm-compressor) by vLLM team
- Quantization guide: [vLLM INT8 W8A8 Documentation](https://docs.vllm.ai/projects/llm-compressor/en/latest/examples/quantization_w8a8_int8/)