--- language: - en license: llama3.1 tags: - llama - llama-3 - llama-3.1 - instruct - int8 - quantized - vllm - llm-compressor - w8a8 - per-token - per-channel - dynamic-quantization base_model: meta-llama/Llama-3.1-8B-Instruct --- # Llama-3.1-8B-Instruct-INT8-W8A8-Dynamic-Per-Token This is an INT8 W8A8 quantized version of [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) created using [llm-compressor](https://github.com/vllm-project/llm-compressor). **Note**: This model quantizes **Weights and Activations only**. KV cache is **NOT** quantized. ## Quantization Details - **Quantization Method**: INT8 W8A8 (Weight and Activation only) - **Weight Precision**: INT8 (8-bit integer), static per-channel quantization - Scale shape: `(N, 1)` — one scale per output channel - Observer: MinMax - **Activation Precision**: INT8 (8-bit integer), dynamic per-token quantization - Scale computed at runtime: `absmax / 127.0` per token (row) - No activation scales stored in checkpoint - **KV Cache**: **Not quantized** (remains in original precision) - **Quantization Format**: `compressed-tensors` (`int-quantized`) - **Ignored Layers**: `lm_head` only - **Calibration Dataset**: CNN/DailyMail - **Calibration Samples**: 512 ## vLLM CUTLASS W8A8 Kernel This model is optimized for the vLLM CUTLASS W8A8 INT8 kernel, which fuses dequantization into the GEMM epilogue: ``` D[m,n] = a_scale[m] * b_scale[n] * int32_accum[m,n] ``` - `a_scale[m]`: per-token activation scale (computed dynamically at runtime) - `b_scale[n]`: per-channel weight scale (stored in checkpoint) - `int32_accum[m,n]`: INT8 x INT8 accumulated in INT32 ## Model Size - **Original Model**: ~16GB (FP16) - **Quantized Model**: ~8.5GB (INT8 W8A8) - **Compression Ratio**: ~1.9x ## Usage ### Installation ```bash pip install vllm>=0.6.0 ``` ### With vLLM ```python from vllm import LLM, SamplingParams # Load the INT8 W8A8 quantized model llm = LLM( model="JongYeop/Llama-3.1-8B-Instruct-INT8-W8A8-Dynamic-Per-Token", ) # Generate text prompts = ["Hello, my name is"] sampling_params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=100) outputs = llm.generate(prompts, sampling_params) for output in outputs: print(output.outputs[0].text) ``` ### With Transformers (for inspection) ```python from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("JongYeop/Llama-3.1-8B-Instruct-INT8-W8A8-Dynamic-Per-Token") model = AutoModelForCausalLM.from_pretrained( "JongYeop/Llama-3.1-8B-Instruct-INT8-W8A8-Dynamic-Per-Token", device_map="auto" ) messages = [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is the capital of France?"} ] input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device) output = model.generate(input_ids, max_new_tokens=100) print(tokenizer.decode(output[0], skip_special_tokens=True)) ``` ## Performance INT8 W8A8 Dynamic Per-Token quantization provides: - **~2x memory reduction** compared to FP16 - **Faster inference** with INT8 GEMM kernels on modern GPUs - **Better accuracy than per-tensor** due to fine-grained per-token activation scaling - **Per-channel weight quantization** preserves weight distribution per output channel ## Quantization Recipe The quantization recipe used for this model is included in the repository as `recipe.yaml`. Key configuration: ```yaml quant_stage: quant_modifiers: QuantizationModifier: ignore: ["lm_head"] config_groups: group_0: weights: num_bits: 8 type: int strategy: channel # Per-channel (one scale per output channel) dynamic: false symmetric: true input_activations: num_bits: 8 type: int strategy: token # Per-token (one scale per row) dynamic: true # Scales computed at runtime symmetric: true targets: ["Linear"] ``` ## Hardware Requirements - **GPU**: NVIDIA GPU with INT8 Tensor Core support (Turing and later) - Examples: RTX 2080 Ti, RTX 3090, RTX 4090, A100, H100 - **VRAM**: Minimum 10GB for inference ## Citation If you use this model, please cite: ```bibtex @software{llm-compressor, title = {LLM Compressor}, author = {vLLM Team}, url = {https://github.com/vllm-project/llm-compressor}, year = {2024} } @article{llama3, title={Llama 3 Model Card}, author={AI@Meta}, year={2024}, url={https://github.com/meta-llama/llama3} } ``` ## License This model inherits the license from the original [Llama 3.1](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) model. ## Acknowledgments - Original model: [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) - Quantization tool: [llm-compressor](https://github.com/vllm-project/llm-compressor) by vLLM team - Quantization guide: [vLLM INT8 W8A8 Documentation](https://docs.vllm.ai/projects/llm-compressor/en/latest/examples/quantization_w8a8_int8/)