---
license: apache-2.0
tags:
  - tensorrt-llm
  - int4
  - awq
  - kv-cache-quantization
  - text-generation
  - mistral
base_model: mistralai/Mistral-7B-Instruct-v0.3
---

# Mistral-7B-Instruct-v0.3 TensorRT-LLM checkpoint (INT4 AWQ + INT8 KV)

TensorRT-LLM **checkpoint** for **Mistral-7B-Instruct-v0.3**, with **INT4 AWQ** weight quantization and **INT8** KV cache. Use with `trtllm-build` to produce an engine for inference.

## Model details

| Item | Value |
|------|--------|
| **Base model** | Mistral-7B-Instruct-v0.3 |
| **Framework** | TensorRT-LLM (checkpoint format) |
| **Weight quantization** | INT4 AWQ |
| **KV cache** | INT8 |
| **Producer** | TensorRT-LLM v0.18.0 `convert_checkpoint.py` (modelopt 0.25.0) |
| **Architecture** | MistralForCausalLM (decoder-only) |

## Build (how to produce this checkpoint)

### 1. Environment and dependencies

```bash
sudo apt install git-lfs
git lfs install
sudo apt-get update && sudo apt-get -y install python3.12 python3-pip

pip3 install tensorrt_llm==0.18.0 --extra-index-url https://pypi.nvidia.com
pip3 install datasets==3.6.0
pip3 install "onnx>=1.12,<1.20"
```

### 2. Clone repos and base model

```bash
git clone -b v0.18.0 https://github.com/NVIDIA/TensorRT-LLM.git
git clone https://huggingface.co/unsloth/mistral-7b-instruct-v0.3
# Or: git clone https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3
```

### 3. Convert checkpoint (INT4 AWQ + INT8 KV)

```bash
python3 TensorRT-LLM/examples/llama/convert_checkpoint.py \
  --model_dir ./mistral-7b-instruct-v0.3 \
  --output_dir ./mistral-7b-instruct-v0.3-trtllm-ckpt-wq_int4_awq-kv_int8 \
  --dtype float16 \
  --use_weight_only \
  --weight_only_precision int4_awq \
  --int8_kv_cache
```

(Optional: set calibration data with `--calib_dataset <path_or_name>`, e.g. a local parquet dir or `pileval`.)

### 4. Output

After conversion, `--output_dir` will contain `config.json` and `rank0.safetensors`; that is the checkpoint in this repo.

## Upload (how to upload to Hugging Face)

```bash
cd ./mistral-7b-instruct-v0.3-trtllm-ckpt-wq_int4_awq-kv_int8

# Create the repo first if it does not exist
huggingface-cli repo create rungalileo/mistral-7b-instruct-v0.3-trtllm-ckpt-wq_int4_awq-kv_int8 --repo-type model

# Upload everything in the current directory to the repo
huggingface-cli upload rungalileo/mistral-7b-instruct-v0.3-trtllm-ckpt-wq_int4_awq-kv_int8 . --repo-type model
```

## How to use

### 1. Build engine

Requires [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) (e.g. v0.18.0) and `tensorrt_llm` installed:

```bash
# Clone this repo or download from HF
git clone https://huggingface.co/rungalileo/mistral-7b-instruct-v0.3-trtllm-ckpt-wq_int4_awq-kv_int8
cd mistral-7b-instruct-v0.3-trtllm-ckpt-wq_int4_awq-kv_int8

# Build TensorRT-LLM engine (adjust max_batch_size / max_seq_len as needed)
trtllm-build --checkpoint_dir . --output_dir ./engine \
  --max_batch_size 1 --max_input_len 512 --max_seq_len 1024
```

### 2. Run inference

Example with `trtllm-serve` (need tokenizer from the base model, e.g. `mistralai/Mistral-7B-Instruct-v0.3`):

```bash
trtllm-serve ./engine --tokenizer mistralai/Mistral-7B-Instruct-v0.3 --port 8000
# Then call OpenAI-compatible API at http://localhost:8000/v1/completions
```

## Files in this repo

- `config.json` – TensorRT-LLM model config
- `rank0.safetensors` – Rank 0 weights (single-GPU)

## References

- [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM)
- [Mistral-7B-Instruct-v0.3](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3)