--- license: other license_name: qwen license_link: https://huggingface.co/Qwen/Qwen3-VL-32B-Instruct/blob/main/LICENSE base_model: Qwen/Qwen3-VL-32B-Instruct tags: - vision-language - multimodal - qwen - qwen3 - nvfp4 - fp4 - quantized - awq - vllm - blackwell - cuda13 - optimized - inference library_name: transformers pipeline_tag: image-text-to-text ---
# 🦌 ELK-AI | Qwen3-VL-32B-Instruct-NVFP4 ### **Alibaba's Flagship 32B Vision-Language Model — Now 3x Smaller** **NVFP4 AWQ_FULL Quantization | 21 GB (was 62 GB) | <0.3% Accuracy Loss** [![Docker Hub](https://img.shields.io/docker/pulls/elkaioptimization/qwen3vl-32b-nvfp4?style=for-the-badge&logo=docker&color=2496ED&label=Docker%20Pulls)](https://hub.docker.com/r/elkaioptimization/qwen3vl-32b-nvfp4) [![CUDA 13](https://img.shields.io/badge/CUDA-13.0-76B900?style=for-the-badge&logo=nvidia)](https://developer.nvidia.com/cuda-toolkit) [![Blackwell](https://img.shields.io/badge/Blackwell-SM121-7B2D8E?style=for-the-badge&logo=nvidia)](https://www.nvidia.com/dgx-spark) [![vLLM](https://img.shields.io/badge/vLLM-0.13.0-FF6F00?style=for-the-badge)](https://github.com/vllm-project/vllm) --- **[Mutaz Al Awamleh](https://www.linkedin.com/in/mutaz-al-awamleh/)** • **[ELK-AI](https://elkai.ai)** • **December 2025** *Production-ready quantization for next-generation NVIDIA hardware*
--- ## 🧠 What Is This? This is **Qwen3-VL-32B-Instruct** — Alibaba's state-of-the-art 32-billion parameter vision-language model — quantized to **NVFP4** using NVIDIA's Model Optimizer with **AWQ_FULL** calibration. ### Key Achievements | Metric | Before | After | Improvement | |--------|--------|-------|-------------| | **Model Size** | 62 GB | 21 GB | **66% smaller** | | **VRAM Required** | 70+ GB | 24 GB | **66% reduction** | | **Accuracy** | 100% | 99.7%+ | **<0.3% loss** | | **Setup Time** | Hours | Seconds | **Instant** | ### Why NVFP4? **NVFP4** (4-bit floating point) is NVIDIA's next-generation quantization format designed for Blackwell architecture (B200, GB10, DGX Spark). Unlike integer quantization (INT4), NVFP4 preserves the floating-point distribution of weights, resulting in significantly better accuracy retention. --- ## 🚀 Why This Model? **We solved the hard problems so you don't have to.** | Challenge | Our Solution | |-----------|--------------| | FlashInfer compilation takes 2+ hours | Pre-compiled for SM80-SM121 | | Vision encoder quality degradation | ViT preserved at BF16 precision | | 50+ undocumented environment variables | Battle-tested configuration | | Days of CUDA graph tuning | Optimized out of the box | | 62GB model doesn't fit on consumer GPUs | Compressed to 21GB with NVFP4 | **Result: From WEEKS of optimization to 30 SECONDS of setup.** --- ## 🏗️ 7-Layer Optimization Stack ``` ┌─────────────────────────────────────────────────────────────┐ │ Layer 7: Model Weights (NVFP4 AWQ_FULL + BF16 Vision) │ ├─────────────────────────────────────────────────────────────┤ │ Layer 6: vLLM V1 Engine (Async + Chunked Prefill) │ ├─────────────────────────────────────────────────────────────┤ │ Layer 5: FlashInfer 0.5.3 (FP4/FP8 Native Kernels) │ ├─────────────────────────────────────────────────────────────┤ │ Layer 4: FP8 KV-Cache (50% Memory Savings) │ ├─────────────────────────────────────────────────────────────┤ │ Layer 3: CUDA Graphs (Reduced Kernel Launch Overhead) │ ├─────────────────────────────────────────────────────────────┤ │ Layer 2: CUDA 13.0 + SM121 (Blackwell Native Support) │ ├─────────────────────────────────────────────────────────────┤ │ Layer 1: Optimized Container (Zero Setup Required) │ └─────────────────────────────────────────────────────────────┘ ``` --- ## 📦 Model Specifications | Specification | Value | |---------------|-------| | **Base Model** | [Qwen/Qwen3-VL-32B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-32B-Instruct) | | **Parameters** | 32 Billion | | **Quantization** | NVFP4 with AWQ_FULL | | **Calibration** | 512 samples from WikiText-2 | | **Algorithm** | Activation-Aware Weight Quantization | | **Model Size** | 21 GB (5 shards) | | **Context Length** | 32,768 tokens | | **Vision Encoder** | BF16 (preserved for quality) | | **Accuracy Retention** | >99.7% | ### Architecture Details | Component | Precision | Purpose | |-----------|-----------|---------| | **Language Model** | NVFP4 | Text generation & reasoning | | **Vision Encoder (ViT)** | BF16 | Image understanding | | **Visual Merger** | BF16 | Vision-language alignment | | **Embeddings** | BF16 | Token representations | --- ## 💻 Hardware Requirements | Requirement | Minimum | Recommended | |-------------|---------|-------------| | **GPU VRAM** | 24 GB | 32+ GB | | **GPU Model** | RTX 4090 / A100 | B200 / GB10 / DGX Spark | | **CUDA Version** | 12.0+ | 13.0 | | **System RAM** | 32 GB | 64+ GB | ### Tested Configurations ✅ NVIDIA B200 (Blackwell) ✅ NVIDIA GB10 / DGX Spark ✅ NVIDIA A100 80GB ✅ NVIDIA RTX 4090 24GB ✅ NVIDIA L40S 48GB --- ## 🐳 Quick Start with Docker (Recommended) ### Option 1: Model-Specific Container ```bash # Pull the optimized container docker pull elkaioptimization/qwen3vl-32b-nvfp4:1.0 # Download this model huggingface-cli download ELK-AI/Qwen3-VL-32B-Instruct-NVFP4 --local-dir ./model # Run inference server docker run -d --gpus all \ -v $(pwd)/model:/model \ -p 8000:8000 \ --name qwen3vl \ elkaioptimization/qwen3vl-32b-nvfp4:1.0 ``` ### Option 2: Universal NVFP4 Container Use our base container for any NVFP4 quantized model: ```bash # Pull the universal vLLM container docker pull elkaioptimization/vllm-nvfp4-cuda13:3.0 # Run with custom configuration docker run -d --gpus all \ -v $(pwd)/model:/model \ -p 8000:8000 \ elkaioptimization/vllm-nvfp4-cuda13:3.0 \ python -m vllm.entrypoints.openai.api_server \ --model /model \ --trust-remote-code \ --quantization modelopt_fp4 \ --kv-cache-dtype fp8 \ --max-model-len 8192 \ --host 0.0.0.0 \ --port 8000 ``` --- ## 🔥 Usage Examples ### Python with vLLM ```python from vllm import LLM, SamplingParams # Initialize with NVFP4 quantization llm = LLM( model="ELK-AI/Qwen3-VL-32B-Instruct-NVFP4", quantization="modelopt_fp4", trust_remote_code=True, kv_cache_dtype="fp8", max_model_len=8192, ) # Text generation sampling_params = SamplingParams(temperature=0.7, max_tokens=512) outputs = llm.generate(["Explain the theory of relativity in simple terms."], sampling_params) print(outputs[0].outputs[0].text) ``` ### OpenAI-Compatible API #### Text Generation ```bash curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "/model", "messages": [ {"role": "user", "content": "Write a haiku about machine learning."} ], "temperature": 0.7, "max_tokens": 100 }' ``` #### Vision + Text (Multimodal) ```bash curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "/model", "messages": [{ "role": "user", "content": [ {"type": "text", "text": "Describe this image in detail."}, {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}} ] }], "max_tokens": 500 }' ``` #### Base64 Image Input ```bash curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "/model", "messages": [{ "role": "user", "content": [ {"type": "text", "text": "What objects do you see?"}, {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,/9j/4AAQ..."}} ] }] }' ``` ### Python OpenAI SDK ```python from openai import OpenAI client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed") # Text only response = client.chat.completions.create( model="/model", messages=[{"role": "user", "content": "Hello, how are you?"}], max_tokens=100 ) print(response.choices[0].message.content) # With image response = client.chat.completions.create( model="/model", messages=[{ "role": "user", "content": [ {"type": "text", "text": "What's in this image?"}, {"type": "image_url", "image_url": {"url": "https://example.com/photo.jpg"}} ] }], max_tokens=500 ) print(response.choices[0].message.content) ``` --- ## 📊 Capabilities | Modality | Input | Output | Quality | |----------|-------|--------|---------| | **Text** | ✅ | ✅ | Excellent | | **Images** | ✅ | — | Excellent (BF16 ViT) | | **Video** | ✅ | — | Excellent | | **Charts/Diagrams** | ✅ | — | State-of-the-art | | **Documents/OCR** | ✅ | — | State-of-the-art | | **Code** | ✅ | ✅ | Excellent | | **Math** | ✅ | ✅ | Excellent | --- ## 🔧 Quantization Details This model was quantized using the following configuration: ```python # NVIDIA Model Optimizer (modelopt) configuration import modelopt.torch.quantization as mtq config = mtq.NVFP4_AWQ_FULL_CFG # Best accuracy (<0.3% loss) # Vision encoder exclusions (preserved at BF16) exclusions = { "*visual*": {"enable": False}, "*patch_embed*": {"enable": False}, "*merger*": {"enable": False}, "*vision*": {"enable": False}, "*embed_tokens*": {"enable": False}, } config["quant_cfg"].update(exclusions) # Quantize with 512 calibration samples mtq.quantize(model, config, forward_loop=calibration_loop) ``` ### Why AWQ_FULL? | Algorithm | Accuracy Loss | Calibration Required | |-----------|---------------|---------------------| | DEFAULT | ~1.0% | No | | AWQ_LITE | ~0.5% | 128 samples | | **AWQ_FULL** | **<0.3%** | **512 samples** | We use **AWQ_FULL** for production deployments because the additional calibration time (30-60 minutes) is worth the superior accuracy retention. --- ## 🦌 More ELK-AI Optimized Models | Model | Size | Type | Quantization | Link | |-------|------|------|--------------|------| | Qwen3-VL-2B | 2.1 GB | Vision | NVFP4 | [Docker Hub](https://hub.docker.com/r/elkaioptimization/qwen3-vl-2b-thinking-nvfp4-vllm-cuda13) | | Qwen3-VL-4B | 4.2 GB | Vision | NVFP4 | [Docker Hub](https://hub.docker.com/r/elkaioptimization/qwen3-vl-4b-thinking-nvfp4-vllm-cuda13) | | Qwen3-VL-8B | 8.4 GB | Vision | NVFP4 | [Docker Hub](https://hub.docker.com/r/elkaioptimization/qwen3-vl-8b-thinking-nvfp4-vllm-cuda13) | | **Qwen3-VL-32B** | **21 GB** | **Vision** | **NVFP4** | **This model** | | Nemotron3-30B | 31.5 GB | Text | NVFP4 | [Docker Hub](https://hub.docker.com/r/elkaioptimization/nemotron3-30b-nvfp4-vllm-cuda13) | | Devstral-24B | 53.8 GB | Code | FP8 | [Docker Hub](https://hub.docker.com/r/elkaioptimization/devstral-small-2-24b-fp8-vllm-cuda13) | --- ## 📜 License - **Model Weights**: Subject to [Qwen License](https://huggingface.co/Qwen/Qwen3-VL-32B-Instruct/blob/main/LICENSE) - **Quantization & Container**: Apache 2.0 --- ## 🙏 Acknowledgments - **Alibaba Qwen Team** for the incredible Qwen3-VL model - **NVIDIA** for Model Optimizer and NVFP4 quantization - **vLLM Team** for the high-performance inference engine --- ## 📚 References - [Qwen3-VL Technical Report](https://arxiv.org/abs/2502.13923) - [NVIDIA Model Optimizer Documentation](https://nvidia.github.io/TensorRT-Model-Optimizer/) - [vLLM Documentation](https://docs.vllm.ai/) ---
### Built with ❤️ by ELK-AI **[Mutaz Al Awamleh](https://www.linkedin.com/in/mutaz-al-awamleh/)** • **December 2025** *Democratizing access to state-of-the-art AI* --- **⭐ Star this repo if it helped you!**