---
license: apache-2.0
language:
- en
- zh
library_name: transformers
tags:
- nvidia
- qwen3
- qwen3-vl
- nvfp4
- quantized
- blackwell
- sm121
- elk-ai
- vllm
- cuda13
- fp4
- vision-language
- thinking
- reasoning
- multimodal
base_model: Qwen/Qwen3-VL-4B-Thinking
pipeline_tag: image-text-to-text
model-index:
- name: qwen3-vl-4b-thinking-nvfp4-w4a16
results: []
---
# Qwen3-VL-4B-Thinking NVFP4 W4A16
### First NVFP4 Quantization of Qwen3-VL-4B-Thinking
**By Mutaz Al Awamleh | [ELK-AI](https://elkai.ai)**
[](https://hub.docker.com/r/elkaioptimization/vllm-nvfp4-cuda-13)
[](https://huggingface.co/cybermotaz)
[](LICENSE)
---
## Model Description
This is the **first publicly available NVFP4 W4A16 quantized** version of **Qwen3-VL-4B-Thinking**, a vision-language model optimized for NVIDIA Blackwell (SM121) architecture.
| Attribute | Original | NVFP4 Quantized |
|-----------|----------|-----------------|
| **Parameters** | 4B | Same |
| **Architecture** | Vision-Language + Thinking | Same |
| **Model Size** | ~8.3 GB | **~3.5 GB** |
| **Memory Savings** | - | **58%** |
| **Precision** | BF16 | FP4 W4A16 |
---
## Quick Start
### Using vLLM (Recommended)
```python
from vllm import LLM, SamplingParams
model = LLM(
model="cybermotaz/qwen3-vl-4b-thinking-nvfp4-w4a16",
trust_remote_code=True,
quantization="modelopt_fp4",
kv_cache_dtype="fp8",
gpu_memory_utilization=0.95
)
sampling_params = SamplingParams(temperature=0.7, max_tokens=512)
prompt = "Think step by step: What is shown in this image?"
outputs = model.generate([prompt], sampling_params)
print(outputs[0].outputs[0].text)
```
### Using Docker (Pre-loaded)
```bash
# Pull the optimized container
docker pull elkaioptimization/vllm-nvfp4-cuda-13:qwen3-vl-4b-thinking-nvfp4-1.0
# Run with OpenAI-compatible API
docker run --gpus all -p 8000:8000 \
elkaioptimization/vllm-nvfp4-cuda-13:qwen3-vl-4b-thinking-nvfp4-1.0
```
---
## Quantization Details
| Parameter | Value |
|-----------|-------|
| **Quantization Format** | NVFP4 (FP4 E2M1) |
| **Weight Precision** | 4-bit (W4) |
| **Activation Precision** | 16-bit (A16) |
| **Block Size** | 16 elements |
| **Scale Format** | FP8 E4M3 |
| **Calibration Dataset** | CNN/DailyMail (512 samples) |
| **Calibration Method** | AWQ-style |
| **Tool Used** | NVIDIA TensorRT-Model-Optimizer |
---
## Hardware Requirements
| Requirement | Minimum | Recommended |
|-------------|---------|-------------|
| **GPU** | RTX 3070 (8GB) | RTX 4090 / DGX Spark |
| **GPU Memory** | 8 GB | 24 GB+ |
| **CUDA** | 12.4+ | 13.0 |
| **Driver** | 560+ | 570+ |
---
## Model Architecture
Qwen3-VL-4B-Thinking features:
- **Vision-Language**: Processes both images and text inputs
- **Enhanced Reasoning**: Optimized for step-by-step thinking and complex reasoning
- **Extended Context**: 32K native, 262K extended context length
- **Multilingual**: Strong performance in English and Chinese
---
## Links
| Resource | Link |
|----------|------|
| **Original Model** | [Qwen/Qwen3-VL-4B-Thinking](https://huggingface.co/Qwen/Qwen3-VL-4B-Thinking) |
| **Docker (Org)** | [elkaioptimization/vllm-nvfp4-cuda-13](https://hub.docker.com/r/elkaioptimization/vllm-nvfp4-cuda-13) |
| **Docker (Personal)** | [mutazai/vllm-spark-blackwell-nvfp4-optimized](https://hub.docker.com/r/mutazai/vllm-spark-blackwell-nvfp4-optimized) |
| **Author** | [Mutaz Al Awamleh](https://www.linkedin.com/in/mutazalawamleh/) |
| **Organization** | [ELK-AI](https://elkai.ai) |
---
## License
This model is released under the Apache 2.0 License, same as the original Qwen3 model.
---
**Built by Mutaz Al Awamleh | ELK-AI**
*First to quantize Qwen3-VL-4B-Thinking to NVFP4 for Blackwell*