---
license: apache-2.0
language:
- en
- zh
library_name: transformers
tags:
- nvidia
- qwen3
- qwen3-vl
- nvfp4
- quantized
- blackwell
- sm121
- elk-ai
- vllm
- cuda13
- fp4
- vision-language
- thinking
- reasoning
- multimodal
base_model: Qwen/Qwen3-VL-4B-Thinking
pipeline_tag: image-text-to-text
model-index:
- name: qwen3-vl-4b-thinking-nvfp4-w4a16
  results: []
---

<div align="center">

# Qwen3-VL-4B-Thinking NVFP4 W4A16

### First NVFP4 Quantization of Qwen3-VL-4B-Thinking

**By Mutaz Al Awamleh | [ELK-AI](https://elkai.ai)**

[![Docker](https://img.shields.io/badge/Docker-elkaioptimization-blue?logo=docker)](https://hub.docker.com/r/elkaioptimization/vllm-nvfp4-cuda-13)
[![HuggingFace](https://img.shields.io/badge/HuggingFace-cybermotaz-yellow?logo=huggingface)](https://huggingface.co/cybermotaz)
[![License](https://img.shields.io/badge/License-Apache%202.0-green.svg)](LICENSE)

</div>

---

## Model Description

This is the **first publicly available NVFP4 W4A16 quantized** version of **Qwen3-VL-4B-Thinking**, a vision-language model optimized for NVIDIA Blackwell (SM121) architecture.

| Attribute | Original | NVFP4 Quantized |
|-----------|----------|-----------------|
| **Parameters** | 4B | Same |
| **Architecture** | Vision-Language + Thinking | Same |
| **Model Size** | ~8.3 GB | **~3.5 GB** |
| **Memory Savings** | - | **58%** |
| **Precision** | BF16 | FP4 W4A16 |

---

## Quick Start

### Using vLLM (Recommended)

```python
from vllm import LLM, SamplingParams

model = LLM(
    model="cybermotaz/qwen3-vl-4b-thinking-nvfp4-w4a16",
    trust_remote_code=True,
    quantization="modelopt_fp4",
    kv_cache_dtype="fp8",
    gpu_memory_utilization=0.95
)

sampling_params = SamplingParams(temperature=0.7, max_tokens=512)
prompt = "Think step by step: What is shown in this image?"

outputs = model.generate([prompt], sampling_params)
print(outputs[0].outputs[0].text)
```

### Using Docker (Pre-loaded)

```bash
# Pull the optimized container
docker pull elkaioptimization/vllm-nvfp4-cuda-13:qwen3-vl-4b-thinking-nvfp4-1.0

# Run with OpenAI-compatible API
docker run --gpus all -p 8000:8000 \
    elkaioptimization/vllm-nvfp4-cuda-13:qwen3-vl-4b-thinking-nvfp4-1.0
```

---

## Quantization Details

| Parameter | Value |
|-----------|-------|
| **Quantization Format** | NVFP4 (FP4 E2M1) |
| **Weight Precision** | 4-bit (W4) |
| **Activation Precision** | 16-bit (A16) |
| **Block Size** | 16 elements |
| **Scale Format** | FP8 E4M3 |
| **Calibration Dataset** | CNN/DailyMail (512 samples) |
| **Calibration Method** | AWQ-style |
| **Tool Used** | NVIDIA TensorRT-Model-Optimizer |

---

## Hardware Requirements

| Requirement | Minimum | Recommended |
|-------------|---------|-------------|
| **GPU** | RTX 3070 (8GB) | RTX 4090 / DGX Spark |
| **GPU Memory** | 8 GB | 24 GB+ |
| **CUDA** | 12.4+ | 13.0 |
| **Driver** | 560+ | 570+ |

---

## Model Architecture

Qwen3-VL-4B-Thinking features:

- **Vision-Language**: Processes both images and text inputs
- **Enhanced Reasoning**: Optimized for step-by-step thinking and complex reasoning
- **Extended Context**: 32K native, 262K extended context length
- **Multilingual**: Strong performance in English and Chinese

---

## Links

| Resource | Link |
|----------|------|
| **Original Model** | [Qwen/Qwen3-VL-4B-Thinking](https://huggingface.co/Qwen/Qwen3-VL-4B-Thinking) |
| **Docker (Org)** | [elkaioptimization/vllm-nvfp4-cuda-13](https://hub.docker.com/r/elkaioptimization/vllm-nvfp4-cuda-13) |
| **Docker (Personal)** | [mutazai/vllm-spark-blackwell-nvfp4-optimized](https://hub.docker.com/r/mutazai/vllm-spark-blackwell-nvfp4-optimized) |
| **Author** | [Mutaz Al Awamleh](https://www.linkedin.com/in/mutazalawamleh/) |
| **Organization** | [ELK-AI](https://elkai.ai) |

---

## License

This model is released under the Apache 2.0 License, same as the original Qwen3 model.

---

<div align="center">

**Built by Mutaz Al Awamleh | ELK-AI**

*First to quantize Qwen3-VL-4B-Thinking to NVFP4 for Blackwell*

</div>