--- license: llama3.2 tags: - tensorrt-llm - nvfp4 - fp4 - kv-cache-quantization - text-generation - llama base_model: meta-llama/Llama-3.2-3B-Instruct --- # Llama-3.2-3B-Instruct TensorRT-LLM checkpoint (NVFP4 weight + FP8 KV) TensorRT-LLM **checkpoint** for **Llama-3.2-3B-Instruct**, with **NVFP4 (W4A4)** weight quantization and **FP8** KV cache. Use with `trtllm-build` to produce an engine for inference. ## Model details | Item | Value | |------|--------| | **Base model** | Llama-3.2-3B-Instruct | | **Framework** | TensorRT-LLM (checkpoint format) | | **Weight quantization** | NVFP4 (W4A4) | | **KV cache** | FP8 | | **Producer** | TensorRT-Model-Optimizer llm_ptq + TensorRT-LLM convert_checkpoint (--use_nvfp4, --fp8_kv_cache) | | **Architecture** | LlamaForCausalLM (decoder-only) | ## Build (how to produce this checkpoint) NVFP4 requires a two-step pipeline: (1) run Model Optimizer llm_ptq to quantize the Hugging Face model to NVFP4; (2) run TensorRT-LLM convert_checkpoint with the PTQ output to produce this checkpoint. ### 1. Environment and dependencies ```bash sudo apt install git-lfs git lfs install pip install tensorrt_llm --extra-index-url https://pypi.nvidia.com # Install TensorRT-Model-Optimizer (required for NVFP4 quantization) # See https://github.com/NVIDIA/TensorRT-Model-Optimizer ``` ### 2. Quantize base model to NVFP4 (llm_ptq) Clone the base model and run Model Optimizer's llm_ptq to produce an NVFP4-quantized HF-format directory. Then run TensorRT-LLM convert_checkpoint: ```bash # Example: after llm_ptq has produced PTQ output (NVFP4 weights), # run convert_checkpoint with that directory as --model_dir: python TensorRT-LLM/examples/llama/convert_checkpoint.py \ --model_dir ./path/to/ptq_output \ --output_dir ./llama-3.2-3b-instruct-trtllm-ckpt-wq_nvfp4-kv_fp8 \ --dtype float16 \ --use_nvfp4 \ --fp8_kv_cache ``` ### 3. Output After conversion, `--output_dir` contains `config.json` and `rank0.safetensors`; that is the checkpoint in this repo. ## Upload (how to upload to Hugging Face) ```bash cd ./llama-3.2-3b-instruct-trtllm-ckpt-wq_nvfp4-kv_fp8 huggingface-cli repo create rungalileo/llama-3.2-3b-instruct-trtllm-ckpt-wq_nvfp4-kv_fp8 --repo-type model huggingface-cli upload rungalileo/llama-3.2-3b-instruct-trtllm-ckpt-wq_nvfp4-kv_fp8 . --repo-type model ``` ## How to use ### 1. Build engine Requires [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) and `tensorrt_llm` installed: ```bash git clone https://huggingface.co/rungalileo/llama-3.2-3b-instruct-trtllm-ckpt-wq_nvfp4-kv_fp8 cd llama-3.2-3b-instruct-trtllm-ckpt-wq_nvfp4-kv_fp8 trtllm-build --checkpoint_dir . --output_dir ./engine \ --max_batch_size 1 --max_input_len 512 --max_seq_len 1024 ``` ### 2. Run inference Use a tokenizer from the base model (e.g. `meta-llama/Llama-3.2-3B-Instruct`): ```bash trtllm-serve ./engine --tokenizer meta-llama/Llama-3.2-3B-Instruct --port 8000 # OpenAI-compatible API: http://localhost:8000/v1/completions ``` ## Files in this repo - `config.json` – TensorRT-LLM model config - `rank0.safetensors` – Rank 0 weights (single-GPU) ## References - [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) - [Llama 3.2](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct)