---
tags:
- fp4
- vllm
language:
- en
- de
- fr
- it
- pt
- hi
- es
- th
pipeline_tag: text-generation
license: llama3.1
base_model: meta-llama/Llama-4-Maverick-17B-128E-Instruct
---
# Llama-4-Maverick-17B-128E-Instruct-NVFP4
## Model Overview
- **Model Architecture:** Meta-Llama-3.1
- **Input:** Text / Image
- **Output:** Text
- **Model Optimizations:**
- **Weight quantization:** FP4
- **Activation quantization:** FP4
- **Intended Use Cases:** Intended for commercial and research use in multiple languages.
- **Out-of-scope:** Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English.
- **Release Date:** 10/29/25
- **Version:** 1.0
- **License(s):** [llama3.1](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B/blob/main/LICENSE)
- **Model Developers:** RedHatAI
This model is a quantized version of [Llama-4-Maverick-17B-128E-Instruct](https://huggingface.co/meta-llama/Llama-4-Maverick-17B-128E-Instruct).
It was evaluated on a several tasks to assess the its quality in comparison to the unquatized model.
### Model Optimizations
This model was obtained by quantizing the weights and activations of [Llama-4-Maverick-17B-128E-Instruct](https://huggingface.co/meta-llama/Llama-4-Maverick-17B-128E-Instruct) to FP4 data type, ready for inference with vLLM>=0.9.1
This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 75%.
Only the weights of the linear operators within transformers blocks are quantized using [LLM Compressor](https://github.com/vllm-project/llm-compressor).
## Deployment
### Use with vLLM
This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
Model Usage Code
```python
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
model_id = "RedHatAI/Llama-4-Maverick-17B-128E-Instruct-NVFP4"
number_gpus = 2
sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256)
tokenizer = AutoTokenizer.from_pretrained(model_id)
messages = [
{"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
{"role": "user", "content": "Who are you?"},
]
prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
llm = LLM(model=model_id, tensor_parallel_size=number_gpus)
outputs = llm.generate(prompts, sampling_params)
generated_text = outputs[0].outputs[0].text
print(generated_text)
```
Model Creation Code
```python
import torch
from datasets import load_dataset
from transformers import Llama4ForConditionalGeneration, Llama4Processor
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
# Select model and load it.
model_id = "meta-llama/Llama-4-Maverick-17B-128E-Instruct"
model = Llama4ForConditionalGeneration.from_pretrained(model_id, torch_dtype="auto")
processor = Llama4Processor.from_pretrained(model_id)
# MoE calibration is now handled automatically by the pipeline.
# The `SequentialLlama4TextMoe` modules (from `llmcompressor.modeling.llama4`)
# will be applied during calibration to enable
# proper expert calibration and vLLM compatibility.
# These replace the original `Llama4TextMoe` class from
# `transformers.models.llama4.modeling_llama4`.
DATASET_ID = "neuralmagic/calibration"
NUM_CALIBRATION_SAMPLES = 20
MAX_SEQUENCE_LENGTH = 8192
ds = load_dataset(DATASET_ID, name="LLM", split=f"train[:{NUM_CALIBRATION_SAMPLES}]")
def preprocess_function(example):
messgages = []
for message in example["messages"]:
messgages.append(
{
"role": message["role"],
"content": [{"type": "text", "text": message["content"]}],
}
)
return processor.apply_chat_template(
messgages,
return_tensors="pt",
padding=False,
truncation=True,
max_length=MAX_SEQUENCE_LENGTH,
tokenize=True,
add_special_tokens=False,
return_dict=True,
add_generation_prompt=False,
)
ds = ds.map(preprocess_function, batched=False, remove_columns=ds.column_names)
def data_collator(batch):
assert len(batch) == 1
return {
key: (
torch.tensor(value)
if key != "pixel_values"
else torch.tensor(value, dtype=torch.bfloat16).squeeze(0)
)
for key, value in batch[0].items()
}
# Configure the quantization algorithm to run.
recipe = QuantizationModifier(
targets="Linear",
scheme="NVFP4",
ignore=[
"re:.*lm_head",
"re:.*self_attn",
"re:.*router",
"re:.*vision_model.*",
"re:.*multi_modal_projector.*",
"Llama4TextAttention",
],
)
# Apply algorithms.
# due to the large size of Llama4, we specify sequential targets such that
# only one MLP is loaded into GPU memory at a time
oneshot(
model=model,
dataset=ds,
recipe=recipe,
max_seq_length=MAX_SEQUENCE_LENGTH,
num_calibration_samples=NUM_CALIBRATION_SAMPLES,
sequential_targets=["Llama4TextMLP"],
data_collator=data_collator,
)
# Save to disk compressed.
SAVE_DIR = model_id.rstrip("/").split("/")[-1] + "-NVFP4"
model.save_pretrained(SAVE_DIR)
processor.save_pretrained(SAVE_DIR)
```
| Category | Metric | Llama-4-Maverick-17B-128E-Instruct | Llama-4-Maverick-17B-128E-Instruct-NVFP4 (this model) | Recovery |
|---|---|---|---|---|
| OpenLLM V1 | arc_challenge_llama | 95.97 | 95.88 | 99.91 |
| gsm8k_llama | 96.13 | 96.06 | 99.93 | |
| mmlu_llama | 86.77 | 85.49 | 98.53 | |
| mmlu_cot_llama | 89.49 | 88.72 | 99.14 | |
| truthfulqa_mc2 | 68.23 | 68.42 | 100.28 | |
| winogrande | 77.98 | 77.74 | 99.69 | |
| hellaswag | ||||
| Average | 85.23 | |||
| OpenLLM V2 | BBH | 69.52 | ||
| MMLU-Pro | 62.83 | |||
| MuSR | 45.77 | |||
| IFEval | 89.45 | |||
| GPQA | 30.54 | |||
| Math-Hard | 64.95 | |||
| Average | 60.51 | |||
| Coding | HumanEval_64 (pass@2) | 88.88 |