--- license: apache-2.0 tags: - tensorrt-llm - int4 - awq - kv-cache-quantization - text-generation - mistral base_model: mistralai/Mistral-7B-Instruct-v0.3 --- # Mistral-7B-Instruct-v0.3 TensorRT-LLM checkpoint (INT4 AWQ + INT8 KV) TensorRT-LLM **checkpoint** for **Mistral-7B-Instruct-v0.3**, with **INT4 AWQ** weight quantization and **INT8** KV cache. Use with `trtllm-build` to produce an engine for inference. ## Model details | Item | Value | |------|--------| | **Base model** | Mistral-7B-Instruct-v0.3 | | **Framework** | TensorRT-LLM (checkpoint format) | | **Weight quantization** | INT4 AWQ | | **KV cache** | INT8 | | **Producer** | TensorRT-LLM v0.18.0 `convert_checkpoint.py` (modelopt 0.25.0) | | **Architecture** | MistralForCausalLM (decoder-only) | ## Build (how to produce this checkpoint) ### 1. Environment and dependencies ```bash sudo apt install git-lfs git lfs install sudo apt-get update && sudo apt-get -y install python3.12 python3-pip pip3 install tensorrt_llm==0.18.0 --extra-index-url https://pypi.nvidia.com pip3 install datasets==3.6.0 pip3 install "onnx>=1.12,<1.20" ``` ### 2. Clone repos and base model ```bash git clone -b v0.18.0 https://github.com/NVIDIA/TensorRT-LLM.git git clone https://huggingface.co/unsloth/mistral-7b-instruct-v0.3 # Or: git clone https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3 ``` ### 3. Convert checkpoint (INT4 AWQ + INT8 KV) ```bash python3 TensorRT-LLM/examples/llama/convert_checkpoint.py \ --model_dir ./mistral-7b-instruct-v0.3 \ --output_dir ./mistral-7b-instruct-v0.3-trtllm-ckpt-wq_int4_awq-kv_int8 \ --dtype float16 \ --use_weight_only \ --weight_only_precision int4_awq \ --int8_kv_cache ``` (Optional: set calibration data with `--calib_dataset `, e.g. a local parquet dir or `pileval`.) ### 4. Output After conversion, `--output_dir` will contain `config.json` and `rank0.safetensors`; that is the checkpoint in this repo. ## Upload (how to upload to Hugging Face) ```bash cd ./mistral-7b-instruct-v0.3-trtllm-ckpt-wq_int4_awq-kv_int8 # Create the repo first if it does not exist huggingface-cli repo create rungalileo/mistral-7b-instruct-v0.3-trtllm-ckpt-wq_int4_awq-kv_int8 --repo-type model # Upload everything in the current directory to the repo huggingface-cli upload rungalileo/mistral-7b-instruct-v0.3-trtllm-ckpt-wq_int4_awq-kv_int8 . --repo-type model ``` ## How to use ### 1. Build engine Requires [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) (e.g. v0.18.0) and `tensorrt_llm` installed: ```bash # Clone this repo or download from HF git clone https://huggingface.co/rungalileo/mistral-7b-instruct-v0.3-trtllm-ckpt-wq_int4_awq-kv_int8 cd mistral-7b-instruct-v0.3-trtllm-ckpt-wq_int4_awq-kv_int8 # Build TensorRT-LLM engine (adjust max_batch_size / max_seq_len as needed) trtllm-build --checkpoint_dir . --output_dir ./engine \ --max_batch_size 1 --max_input_len 512 --max_seq_len 1024 ``` ### 2. Run inference Example with `trtllm-serve` (need tokenizer from the base model, e.g. `mistralai/Mistral-7B-Instruct-v0.3`): ```bash trtllm-serve ./engine --tokenizer mistralai/Mistral-7B-Instruct-v0.3 --port 8000 # Then call OpenAI-compatible API at http://localhost:8000/v1/completions ``` ## Files in this repo - `config.json` – TensorRT-LLM model config - `rank0.safetensors` – Rank 0 weights (single-GPU) ## References - [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) - [Mistral-7B-Instruct-v0.3](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3)