--- license: apache-2.0 language: - en - zh library_name: transformers tags: - nvidia - qwen3 - qwen3-vl - nvfp4 - quantized - blackwell - sm121 - elk-ai - vllm - cuda13 - fp4 - vision-language - thinking - reasoning - multimodal base_model: Qwen/Qwen3-VL-4B-Thinking pipeline_tag: image-text-to-text model-index: - name: qwen3-vl-4b-thinking-nvfp4-w4a16 results: [] ---
# Qwen3-VL-4B-Thinking NVFP4 W4A16 ### First NVFP4 Quantization of Qwen3-VL-4B-Thinking **By Mutaz Al Awamleh | [ELK-AI](https://elkai.ai)** [![Docker](https://img.shields.io/badge/Docker-elkaioptimization-blue?logo=docker)](https://hub.docker.com/r/elkaioptimization/vllm-nvfp4-cuda-13) [![HuggingFace](https://img.shields.io/badge/HuggingFace-cybermotaz-yellow?logo=huggingface)](https://huggingface.co/cybermotaz) [![License](https://img.shields.io/badge/License-Apache%202.0-green.svg)](LICENSE)
--- ## Model Description This is the **first publicly available NVFP4 W4A16 quantized** version of **Qwen3-VL-4B-Thinking**, a vision-language model optimized for NVIDIA Blackwell (SM121) architecture. | Attribute | Original | NVFP4 Quantized | |-----------|----------|-----------------| | **Parameters** | 4B | Same | | **Architecture** | Vision-Language + Thinking | Same | | **Model Size** | ~8.3 GB | **~3.5 GB** | | **Memory Savings** | - | **58%** | | **Precision** | BF16 | FP4 W4A16 | --- ## Quick Start ### Using vLLM (Recommended) ```python from vllm import LLM, SamplingParams model = LLM( model="cybermotaz/qwen3-vl-4b-thinking-nvfp4-w4a16", trust_remote_code=True, quantization="modelopt_fp4", kv_cache_dtype="fp8", gpu_memory_utilization=0.95 ) sampling_params = SamplingParams(temperature=0.7, max_tokens=512) prompt = "Think step by step: What is shown in this image?" outputs = model.generate([prompt], sampling_params) print(outputs[0].outputs[0].text) ``` ### Using Docker (Pre-loaded) ```bash # Pull the optimized container docker pull elkaioptimization/vllm-nvfp4-cuda-13:qwen3-vl-4b-thinking-nvfp4-1.0 # Run with OpenAI-compatible API docker run --gpus all -p 8000:8000 \ elkaioptimization/vllm-nvfp4-cuda-13:qwen3-vl-4b-thinking-nvfp4-1.0 ``` --- ## Quantization Details | Parameter | Value | |-----------|-------| | **Quantization Format** | NVFP4 (FP4 E2M1) | | **Weight Precision** | 4-bit (W4) | | **Activation Precision** | 16-bit (A16) | | **Block Size** | 16 elements | | **Scale Format** | FP8 E4M3 | | **Calibration Dataset** | CNN/DailyMail (512 samples) | | **Calibration Method** | AWQ-style | | **Tool Used** | NVIDIA TensorRT-Model-Optimizer | --- ## Hardware Requirements | Requirement | Minimum | Recommended | |-------------|---------|-------------| | **GPU** | RTX 3070 (8GB) | RTX 4090 / DGX Spark | | **GPU Memory** | 8 GB | 24 GB+ | | **CUDA** | 12.4+ | 13.0 | | **Driver** | 560+ | 570+ | --- ## Model Architecture Qwen3-VL-4B-Thinking features: - **Vision-Language**: Processes both images and text inputs - **Enhanced Reasoning**: Optimized for step-by-step thinking and complex reasoning - **Extended Context**: 32K native, 262K extended context length - **Multilingual**: Strong performance in English and Chinese --- ## Links | Resource | Link | |----------|------| | **Original Model** | [Qwen/Qwen3-VL-4B-Thinking](https://huggingface.co/Qwen/Qwen3-VL-4B-Thinking) | | **Docker (Org)** | [elkaioptimization/vllm-nvfp4-cuda-13](https://hub.docker.com/r/elkaioptimization/vllm-nvfp4-cuda-13) | | **Docker (Personal)** | [mutazai/vllm-spark-blackwell-nvfp4-optimized](https://hub.docker.com/r/mutazai/vllm-spark-blackwell-nvfp4-optimized) | | **Author** | [Mutaz Al Awamleh](https://www.linkedin.com/in/mutazalawamleh/) | | **Organization** | [ELK-AI](https://elkai.ai) | --- ## License This model is released under the Apache 2.0 License, same as the original Qwen3 model. ---
**Built by Mutaz Al Awamleh | ELK-AI** *First to quantize Qwen3-VL-4B-Thinking to NVFP4 for Blackwell*