Instructions to use unsloth/Mistral-Small-4-119B-2603-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use unsloth/Mistral-Small-4-119B-2603-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="unsloth/Mistral-Small-4-119B-2603-GGUF",
	filename="BF16/Mistral-Small-4-119B-2603-BF16-00001-of-00006.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use unsloth/Mistral-Small-4-119B-2603-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf unsloth/Mistral-Small-4-119B-2603-GGUF:UD-Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf unsloth/Mistral-Small-4-119B-2603-GGUF:UD-Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf unsloth/Mistral-Small-4-119B-2603-GGUF:UD-Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf unsloth/Mistral-Small-4-119B-2603-GGUF:UD-Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf unsloth/Mistral-Small-4-119B-2603-GGUF:UD-Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf unsloth/Mistral-Small-4-119B-2603-GGUF:UD-Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf unsloth/Mistral-Small-4-119B-2603-GGUF:UD-Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf unsloth/Mistral-Small-4-119B-2603-GGUF:UD-Q4_K_M

Use Docker

docker model run hf.co/unsloth/Mistral-Small-4-119B-2603-GGUF:UD-Q4_K_M

LM Studio
Jan
Ollama
How to use unsloth/Mistral-Small-4-119B-2603-GGUF with Ollama:
```
ollama run hf.co/unsloth/Mistral-Small-4-119B-2603-GGUF:UD-Q4_K_M
```

Unsloth Studio

How to use unsloth/Mistral-Small-4-119B-2603-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for unsloth/Mistral-Small-4-119B-2603-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for unsloth/Mistral-Small-4-119B-2603-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for unsloth/Mistral-Small-4-119B-2603-GGUF to start chatting

How to use unsloth/Mistral-Small-4-119B-2603-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf unsloth/Mistral-Small-4-119B-2603-GGUF:UD-Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "unsloth/Mistral-Small-4-119B-2603-GGUF:UD-Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use unsloth/Mistral-Small-4-119B-2603-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf unsloth/Mistral-Small-4-119B-2603-GGUF:UD-Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default unsloth/Mistral-Small-4-119B-2603-GGUF:UD-Q4_K_M

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use unsloth/Mistral-Small-4-119B-2603-GGUF with Docker Model Runner:
```
docker model run hf.co/unsloth/Mistral-Small-4-119B-2603-GGUF:UD-Q4_K_M
```

Lemonade

How to use unsloth/Mistral-Small-4-119B-2603-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull unsloth/Mistral-Small-4-119B-2603-GGUF:UD-Q4_K_M

Run and chat with the model

lemonade run user.Mistral-Small-4-119B-2603-GGUF-UD-Q4_K_M

List all available models

lemonade list

Includes Unsloth chat template fixes!
For llama.cpp, use --jinja

Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants.

Mistral Small 4 119B A6B

Mistral Small 4 is a powerful hybrid model capable of acting as both a general instruction model and a reasoning model. It unifies the capabilities of three different model families—Instruct, Reasoning (previously called Magistral), and Devstral—into a single, unified model.

With its multimodal capabilities, efficient architecture, and flexible mode switching, it is a powerful general-purpose model for any task. In a latency-optimized setup, Mistral Small 4 achieves a 40% reduction in end-to-end completion time, and in a throughput-optimized setup, it handles 3x more requests per second compared to Mistral Small 3.

To further improve efficiency you can either take advantages of:

Speculative decoding thanks to our trained eagle head mistralai/Mistral-Small-4-119B-2603-eagle.
4 bit float precision quantization thanks to our NVFP4 checkpoint mistralai/Mistral-Small-4-119B-2603-NVFP4.

Key Features

Mistral Small 4 includes the following architectural choices:

MoE: 128 experts, 4 active.
119B parameters, with 6.5B activated per token.
256k context length.
Multimodal input: Accepts both text and image input, with text output.
Instruct and Reasoning functionalities with function calls (reasoning effort configurable per request).

Mistral Small 4 offers the following capabilities:

Reasoning Mode: Toggle between fast instant reply mode and reasoning mode, boosting performance with test-time compute when requested.
Vision: Analyzes images and provides insights based on visual content, in addition to text.
Multilingual: Supports dozens of languages, including English, French, Spanish, German, Italian, Portuguese, Dutch, Chinese, Japanese, Korean, and Arabic.
System Prompt: Strong adherence and support for system prompts.
Agentic: Best-in-class agentic capabilities with native function calling and JSON output.
Speed-Optimized: Delivers best-in-class performance and speed.
Apache 2.0 License: Open-source license for both commercial and non-commercial use.
Large Context Window: Supports a 256k context window.

Use Cases

Mistral Small 4 is designed for general chat assistants, coding, agentic tasks, and reasoning tasks (with reasoning mode toggled). Its multimodal capabilities also enable document and image understanding for data extraction and analysis.

Its capabilities are ideal for:

Developers interested in coding and agentic capabilities for SWE automation and codebase exploration.
Enterprises seeking general chat assistants, agents, and document understanding.
Researchers leveraging its math and research capabilities.

Mistral Small 4 is also well-suited for customization and fine-tuning for more specialized tasks.

Examples

General chat assistant
Document parsing and extraction
Coding agent
Research assistant
Customization & fine-tuning
And more...

Benchmarks

Comparison with internal models

Depending on your tasks you can trigger reasoning thanks to the support of the per-request parameter reasoning_effort. Set it to:

reasoning_effort="none": Fast, lightweight responses for everyday tasks, equivalent to the same chat style of mistralai/Mistral-Small-3.2-24B-Instruct-2506.
reasoning_effort="high": Deep, step-by-step reasoning for complex problems, with equivalent verbosity to previous Magistral models such as mistralai/Magistral-Small-2509.

Comparing Reasoning Models

Comparison with other models

Mistral Small 4 with reasoning achieves competitive scores, matching or surpassing GPT-OSS 120B across all three benchmarks while generating significantly shorter outputs. On AA LCR, Mistral Small 4 scores 0.72 with just 1.6K characters, whereas Qwen models require 3.5-4x more output (5.8-6.1K) for comparable performance. On LiveCodeBench, Mistral Small 4 outperforms GPT-OSS 120B while producing 20% less output. This efficiency reduces latency, inference costs, and improves user experience.

Usage

You can find Mistral Small 4 support on multiple libraries for inference and fine-tuning. We here thank everyone contributors and maintainers that helped us making it happen.

Inference

The model can be deployed with:

vllm (recommended): See here.
llama.cpp: See here. (WIP ⏳ – follow updates here)
SGLang: (WIP ⏳ – follow updates here)
transformers: See here

For optimal performance, we recommend using the Mistral AI API if local serving is subpar.

Fine-Tuning

Fine-tune the model via:

Axolotl: See here.

vLLM (Recommended)

We recommend using Mistral Small 4 with the vLLM library for production-ready inference.

Installation

Use our custom Docker image with fixes for tool calling and reasoning parsing in vLLM, and the latest Transformers version. We are working with the vLLM team to merge these fixes soon.

Custom Docker Use the following Docker image: mistralllm/vllm-ms4:latest:

docker pull mistralllm/vllm-ms4:latest
docker run -it mistralllm/vllm-ms4:latest

Manual Install Alternatively, install vllm from this PR: Add Mistral Guidance.

Note: This PR is expected to be merged into vllm main in the next 1-2 weeks (as of 16.03.2026). Track updates here.

Clone vLLM:

git clone --branch fix_mistral_parsing https://github.com/juliendenize/vllm.git

Install with pre-compiled kernels:

VLLM_USE_PRECOMPILED=1 pip install --editable .

Install transformers from main:

uv pip install git+https://github.com/huggingface/transformers.git

Ensure mistral_common >= 1.10.0 is installed:

python -c "import mistral_common; print(mistral_common.__version__)"

Serve the Model

We recommend a server/client setup:

vllm serve mistralai/Mistral-Small-4-119B-2603 --max-model-len 262144 --tensor-parallel-size 2 --attention-backend FLASH_ATTN_MLA \
  --tool-call-parser mistral --enable-auto-tool-choice --reasoning-parser mistral --max_num_batched_tokens 16384 --max_num_seqs 128 \
  --gpu_memory_utilization 0.8

Ping the Server

Instruction Following

Mistral Small 4 can follow your instructions to the letter.

from datetime import datetime, timedelta

from openai import OpenAI
from huggingface_hub import hf_hub_download

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

TEMP = 0.1

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

models = client.models.list()
model = models.data[0].id


def load_system_prompt(repo_id: str, filename: str) -> str:
    file_path = hf_hub_download(repo_id=repo_id, filename=filename)
    with open(file_path, "r") as file:
        system_prompt = file.read()
    today = datetime.today().strftime("%Y-%m-%d")
    yesterday = (datetime.today() - timedelta(days=1)).strftime("%Y-%m-%d")
    model_name = repo_id.split("/")[-1]
    return system_prompt.format(name=model_name, today=today, yesterday=yesterday)


SYSTEM_PROMPT = load_system_prompt(model, "SYSTEM_PROMPT.txt")

messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {
        "role": "user",
        "content": "Write me a sentence where every word starts with the next letter in the alphabet - start with 'a' and end with 'z'.",
    },
]

response = client.chat.completions.create(
    model=model,
    messages=messages,
    temperature=TEMP,
    reasoning_effort="none",
)

assistant_message = response.choices[0].message.content
print(assistant_message)

Tool Call

Let's solve some equations thanks to our simple Python calculator tool.

import json
from datetime import datetime, timedelta

from openai import OpenAI
from huggingface_hub import hf_hub_download

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

TEMP = 0.1

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

models = client.models.list()
model = models.data[0].id


def load_system_prompt(repo_id: str, filename: str) -> str:
    file_path = hf_hub_download(repo_id=repo_id, filename=filename)
    with open(file_path, "r") as file:
        system_prompt = file.read()
    today = datetime.today().strftime("%Y-%m-%d")
    yesterday = (datetime.today() - timedelta(days=1)).strftime("%Y-%m-%d")
    model_name = repo_id.split("/")[-1]
    return system_prompt.format(name=model_name, today=today, yesterday=yesterday)


SYSTEM_PROMPT = load_system_prompt(model, "SYSTEM_PROMPT.txt")

image_url = "https://math-coaching.com/img/fiche/46/expressions-mathematiques.jpg"


def my_calculator(expression: str) -> str:
    return str(eval(expression))


tools = [
    {
        "type": "function",
        "function": {
            "name": "my_calculator",
            "description": "A calculator that can evaluate a mathematical expression.",
            "parameters": {
                "type": "object",
                "properties": {
                    "expression": {
                        "type": "string",
                        "description": "The mathematical expression to evaluate.",
                    },
                },
                "required": ["expression"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "rewrite",
            "description": "Rewrite a given text for improved clarity",
            "parameters": {
                "type": "object",
                "properties": {
                    "text": {
                        "type": "string",
                        "description": "The input text to rewrite",
                    }
                },
            },
        },
    },
]

messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "Thanks to your calculator, compute the results for the equations that involve numbers displayed in the image.",
            },
            {
                "type": "image_url",
                "image_url": {
                    "url": image_url,
                },
            },
        ],
    },
]

response = client.chat.completions.create(
    model=model,
    messages=messages,
    temperature=TEMP,
    tools=tools,
    tool_choice="auto",
    reasoning_effort="none",
)

tool_calls = response.choices[0].message.tool_calls

results = []
for tool_call in tool_calls:
    function_name = tool_call.function.name
    function_args = tool_call.function.arguments
    if function_name == "my_calculator":
        result = my_calculator(**json.loads(function_args))
        results.append(result)

messages.append({"role": "assistant", "tool_calls": tool_calls})
for tool_call, result in zip(tool_calls, results):
    messages.append(
        {
            "role": "tool",
            "tool_call_id": tool_call.id,
            "name": tool_call.function.name,
            "content": result,
        }
    )


response = client.chat.completions.create(
    model=model,
    messages=messages,
    temperature=TEMP,
    reasoning_effort="none",
)

print(response.choices[0].message.content)

Vision Reasoning

Let's see if the Mistral Small 4 knows when to pick a fight !

from datetime import datetime, timedelta

from openai import OpenAI
from huggingface_hub import hf_hub_download

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

TEMP = 0.1

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

models = client.models.list()
model = models.data[0].id


def load_system_prompt(repo_id: str, filename: str) -> str:
    file_path = hf_hub_download(repo_id=repo_id, filename=filename)
    with open(file_path, "r") as file:
        system_prompt = file.read()
    today = datetime.today().strftime("%Y-%m-%d")
    yesterday = (datetime.today() - timedelta(days=1)).strftime("%Y-%m-%d")
    model_name = repo_id.split("/")[-1]
    return system_prompt.format(name=model_name, today=today, yesterday=yesterday)


SYSTEM_PROMPT = load_system_prompt(model, "SYSTEM_PROMPT.txt")
image_url = "https://static.wikia.nocookie.net/essentialsdocs/images/7/70/Battle.png/revision/latest?cb=20220523172438"

messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "What action do you think I should take in this situation? List all the possible actions and explain why you think they are good or bad.",
            },
            {"type": "image_url", "image_url": {"url": image_url}},
        ],
    },
]


response = client.chat.completions.create(
    model=model,
    messages=messages,
    temperature=TEMP,
    reasoning_effort="high",
)

print(response.choices[0].message.content)

Transformers

Installation

You need to install the main branch of Transformers to use Mistral Small 4:

uv pip install git+https://github.com/huggingface/transformers.git

Inference

Note: Current implementation of Transformers does not support FP8. Weights have been stored in FP8 and updates to load them in this format are expected, in the meantime we provide BF16 quantization snippets to ease usage. As soon as support is added, we will update the following code snippet.

Python Inference Snippet

from pathlib import Path

import torch
from huggingface_hub import snapshot_download
from safetensors.torch import load_file
from tqdm import tqdm

from transformers import AutoConfig, AutoProcessor, Mistral3ForConditionalGeneration


def _descale_fp8_to_bf16(tensor: torch.Tensor, scale_inv: torch.Tensor) -> torch.Tensor:
    return (tensor.to(torch.bfloat16) * scale_inv.to(torch.bfloat16)).to(torch.bfloat16)


def _resolve_model_dir(model_id: str) -> Path:
    local = Path(model_id)
    if local.is_dir():
        return local
    return Path(snapshot_download(model_id, allow_patterns=["model*.safetensors"]))


def load_and_dequantize_state_dict(model_id: str) -> dict[str, torch.Tensor]:
    model_dir = _resolve_model_dir(model_id)

    shards = sorted(model_dir.glob("model*.safetensors"))

    full_state_dict: dict[str, torch.Tensor] = {}
    for shard in tqdm(shards, desc="Loading safetensors shards"):
        full_state_dict.update(load_file(str(shard)))

    scale_suffixes = ("weight_scale_inv", "gate_up_proj_scale_inv", "down_proj_scale_inv", "up_proj_scale_inv")
    activation_scale_suffixes = ("activation_scale", "gate_up_proj_activation_scale", "down_proj_activation_scale")

    keys_to_remove: set[str] = set()
    all_keys = list(full_state_dict.keys())

    for key in tqdm(all_keys, desc="Dequantizing FP8 weights to BF16"):
        if any(key.endswith(s) for s in scale_suffixes + activation_scale_suffixes):
            continue

        for scale_suffix in scale_suffixes:
            if scale_suffix == "weight_scale_inv":
                if not key.endswith(".weight"):
                    continue
                scale_key = key.rsplit(".weight", 1)[0] + ".weight_scale_inv"
            else:
                proj_name = scale_suffix.replace("_scale_inv", "")
                if not key.endswith(f".{proj_name}"):
                    continue
                scale_key = key + "_scale_inv"

            if scale_key in full_state_dict:
                full_state_dict[key] = _descale_fp8_to_bf16(full_state_dict[key], full_state_dict[scale_key])
                keys_to_remove.add(scale_key)

    for key in full_state_dict:
        if any(key.endswith(s) for s in activation_scale_suffixes):
            keys_to_remove.add(key)

    for key in tqdm(keys_to_remove, desc="Removing scale keys"):
        del full_state_dict[key]

    return full_state_dict


def load_config_without_quantization(model_id: str) -> AutoConfig:
    config = AutoConfig.from_pretrained(model_id)

    if hasattr(config, "quantization_config"):
        del config.quantization_config

    if hasattr(config, "text_config") and hasattr(config.text_config, "quantization_config"):
        del config.text_config.quantization_config

    return config


model_id = "mistralai/Mistral-Small-4-119B-2603"

config = load_config_without_quantization(model_id)
state_dict = load_and_dequantize_state_dict(model_id)

model = Mistral3ForConditionalGeneration.from_pretrained(
    None,
    config=config,
    state_dict=state_dict,
    device_map="auto",
)

processor = AutoProcessor.from_pretrained(model_id)

image_url = "https://static.wikia.nocookie.net/essentialsdocs/images/7/70/Battle.png/revision/latest?cb=20220523172438"

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "What action do you think I should take in this situation? List all the possible actions and explain why you think they are good or bad.",
            },
            {"type": "image_url", "image_url": {"url": image_url}},
        ],
    },
]

inputs = processor.apply_chat_template(
    messages, return_tensors="pt", tokenize=True, return_dict=True, reasoning_effort="high"
)
inputs = inputs.to(model.device)

output = model.generate(
    **inputs,
    max_new_tokens=1024,
)[0]

# Setting `skip_special_tokens=False` to visualize reasoning trace between [THINK] [/THINK] tags.
decoded_output = processor.decode(output[len(inputs["input_ids"][0]) :], skip_special_tokens=False)
print(decoded_output)

License

This model is licensed under the Apache 2.0 License.

You must not use this model in a manner that infringes, misappropriates, or violates any third party’s rights, including intellectual property rights.

Downloads last month: 9,069

GGUF

Model size

119B params

Architecture

mistral4

Hardware compatibility

1-bit

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for unsloth/Mistral-Small-4-119B-2603-GGUF

Base model

mistralai/Mistral-Small-4-119B-2603

Quantized

(31)

this model

Quantizations

1 model