---
license: llama3.2
tags:
  - tensorrt-llm
  - nvfp4
  - fp4
  - kv-cache-quantization
  - text-generation
  - llama
base_model: meta-llama/Llama-3.2-3B-Instruct
---

# Llama-3.2-3B-Instruct TensorRT-LLM checkpoint (NVFP4 weight + FP8 KV)

TensorRT-LLM **checkpoint** for **Llama-3.2-3B-Instruct**, with **NVFP4 (W4A4)** weight quantization and **FP8** KV cache. Use with `trtllm-build` to produce an engine for inference.

## Model details

| Item | Value |
|------|--------|
| **Base model** | Llama-3.2-3B-Instruct |
| **Framework** | TensorRT-LLM (checkpoint format) |
| **Weight quantization** | NVFP4 (W4A4) |
| **KV cache** | FP8 |
| **Producer** | TensorRT-Model-Optimizer llm_ptq + TensorRT-LLM convert_checkpoint (--use_nvfp4, --fp8_kv_cache) |
| **Architecture** | LlamaForCausalLM (decoder-only) |

## Build (how to produce this checkpoint)

NVFP4 requires a two-step pipeline: (1) run Model Optimizer llm_ptq to quantize the Hugging Face model to NVFP4; (2) run TensorRT-LLM convert_checkpoint with the PTQ output to produce this checkpoint.

### 1. Environment and dependencies

```bash
sudo apt install git-lfs
git lfs install

pip install tensorrt_llm --extra-index-url https://pypi.nvidia.com
# Install TensorRT-Model-Optimizer (required for NVFP4 quantization)
# See https://github.com/NVIDIA/TensorRT-Model-Optimizer
```

### 2. Quantize base model to NVFP4 (llm_ptq)

Clone the base model and run Model Optimizer's llm_ptq to produce an NVFP4-quantized HF-format directory. Then run TensorRT-LLM convert_checkpoint:

```bash
# Example: after llm_ptq has produced PTQ output (NVFP4 weights),
# run convert_checkpoint with that directory as --model_dir:
python TensorRT-LLM/examples/llama/convert_checkpoint.py \
  --model_dir ./path/to/ptq_output \
  --output_dir ./llama-3.2-3b-instruct-trtllm-ckpt-wq_nvfp4-kv_fp8 \
  --dtype float16 \
  --use_nvfp4 \
  --fp8_kv_cache
```

### 3. Output

After conversion, `--output_dir` contains `config.json` and `rank0.safetensors`; that is the checkpoint in this repo.

## Upload (how to upload to Hugging Face)

```bash
cd ./llama-3.2-3b-instruct-trtllm-ckpt-wq_nvfp4-kv_fp8

huggingface-cli repo create rungalileo/llama-3.2-3b-instruct-trtllm-ckpt-wq_nvfp4-kv_fp8 --repo-type model
huggingface-cli upload rungalileo/llama-3.2-3b-instruct-trtllm-ckpt-wq_nvfp4-kv_fp8 . --repo-type model
```

## How to use

### 1. Build engine

Requires [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) and `tensorrt_llm` installed:

```bash
git clone https://huggingface.co/rungalileo/llama-3.2-3b-instruct-trtllm-ckpt-wq_nvfp4-kv_fp8
cd llama-3.2-3b-instruct-trtllm-ckpt-wq_nvfp4-kv_fp8

trtllm-build --checkpoint_dir . --output_dir ./engine \
  --max_batch_size 1 --max_input_len 512 --max_seq_len 1024
```

### 2. Run inference

Use a tokenizer from the base model (e.g. `meta-llama/Llama-3.2-3B-Instruct`):

```bash
trtllm-serve ./engine --tokenizer meta-llama/Llama-3.2-3B-Instruct --port 8000
# OpenAI-compatible API: http://localhost:8000/v1/completions
```

## Files in this repo

- `config.json` – TensorRT-LLM model config
- `rank0.safetensors` – Rank 0 weights (single-GPU)

## References

- [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM)
- [Llama 3.2](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct)