--- license: other license_name: nsclv1 license_link: https://huggingface.co/nvidia/NV-Reason-CXR-3B/blob/main/LICENSE base_model: nvidia/NV-Reason-CXR-3B tags: - medical - x-ray - vision-language - gguf - quantized - mobile - cxr - radiology - qwen2.5-vl - llama.cpp - cactus-compute model_type: qwen2vl language: - en pipeline_tag: image-text-to-text --- # NV-Reason-CXR-3B GGUF (Quantized for Edge) Quantized GGUF versions of NVIDIA's [NV-Reason-CXR-3B](https://huggingface.co/nvidia/NV-Reason-CXR-3B) vision-language model optimized for edge deployment for [Cactus Compute](https://github.com/cactus-compute/cactus) and [llama.cpp](https://github.com/ggerganov/llama.cpp). ## Model Description This repository contains quantized versions of NV-Reason-CXR-3B, a 3B parameter vision-language model specialized in chest X-ray analysis. The model has been converted to GGUF format and quantized for efficient deployment on edge devices (mobile, desktop, embedded systems). **Original Model:** [nvidia/NV-Reason-CXR-3B](https://huggingface.co/nvidia/NV-Reason-CXR-3B) **Base Architecture:** Qwen2.5-VL 3B Instruct **Conversion:** llama.cpp **Quantization:** llama-cpp-python ## Available Models | Filename | Format | Size | Use Case | Quality | Speed | |----------|--------|------|----------|---------|-------| | `nv-reason-cxr-3b-fp16.gguf` | FP16 | 6.3 GB | Desktop with GPU (quality reference) | 100% | Baseline | | `nv-reason-cxr-3b-Q4_K_M.gguf` | Q4_K_M | 1.96 GB | **Recommended for edge devices** | 90-95% | Fast | | `mmproj-nv-reason-cxr-3b-f16.gguf` | FP16 mmproj | 1.25 GB | **Vision encoder (required for image analysis)** | 100% | - | ### Model Details **Q4_K_M (Recommended):** - **Size:** 1.96 GB (69% reduction from FP16) - **Compression:** 3.23x from original - **Quality:** 90-95% retention - **Speed:** 8-20 tokens/sec on mobile (device-dependent) - **RAM Required:** 3-4 GB - **Best for:** Mid-range to high-end mobile devices **FP16 (Reference):** - **Size:** 6.3 GB - **Quality:** Original precision - **Speed:** Slower than quantized - **RAM Required:** 8+ GB - **Best for:** Desktop inference, quality comparison ## Performance Benchmarks ### Desktop (Apple M3 Mac) **Q4_K_M Performance:** | Configuration | Load Time | Inference Speed | Memory Usage | |--------------|-----------|-----------------|--------------| | CPU-only | 1.87s | 29.61 tok/s | ~2 GB RAM | | **M3 GPU (Metal)** | **0.34s** | **33.24 tok/s** | ~2 GB RAM | | **Speedup** | **5.46x faster** ⚡ | 1.12x faster | Same | **Key Insights:** - 🚀 **GPU provides 5.46x faster model loading** - Huge benefit for app cold starts! - ⚡ Modest 1.12x inference speedup - Q4_K_M is already highly CPU-optimized - ✅ **Excellent CPU performance** - GPU acceleration is optional, not required - 💪 Mobile devices will run well even without dedicated GPU **Test Hardware:** Apple M3 MacBook Pro (Metal GPU support) ### Mobile Projections | Device | RAM | Expected Speed | Load Time | Rating | |--------|-----|----------------|-----------|--------| | Budget Android | 3GB | 3-5 tok/s | 30-45s | Poor | | Mid-range Android | 4GB | 8-12 tok/s | 20-30s | Good | | High-end Android | 6GB | 15-20 tok/s | 15-25s | Excellent | | iPhone 12+ | 4-6GB | 12-18 tok/s | 15-20s | Excellent | | iPhone 14+ | 6GB+ | 18-25 tok/s | 10-15s | Optimal | **Minimum Requirements:** - 4GB RAM - 3GB free storage - iOS 14+ or Android 8+ ## Usage ### With llama.cpp ```bash # Clone llama.cpp git clone https://github.com/ggerganov/llama.cpp cd llama.cpp make # Download model huggingface-cli download samwell/NV-Reason-CXR-3B-GGUF \ nv-reason-cxr-3b-Q4_K_M.gguf --local-dir ./models # Run inference ./llama-cli \ -m models/nv-reason-cxr-3b-Q4_K_M.gguf \ -p "Analyze this chest X-ray image." \ --image xray.jpg \ -n 512 \ --temp 0.3 ``` ### With llama-cpp-python ```python from llama_cpp import Llama # Option 1: CPU-only (works great, 29.61 tok/s on M3) llm = Llama( model_path="nv-reason-cxr-3b-Q4_K_M.gguf", n_ctx=4096, n_threads=4, n_gpu_layers=0, # CPU-only ) # Option 2: GPU acceleration (5.46x faster loading!) llm = Llama( model_path="nv-reason-cxr-3b-Q4_K_M.gguf", n_ctx=4096, n_threads=4, n_gpu_layers=-1, # Use GPU (Metal on Mac, CUDA on Linux/Windows) ) # Analyze X-ray response = llm( "Analyze this chest X-ray image and identify key findings.", max_tokens=512, temperature=0.3, # Lower for medical = more deterministic top_p=0.9, ) print(response['choices'][0]['text']) ``` ### With Cactus Compute (Flutter/Mobile) **Note:** You need BOTH the model file AND the mmproj file for image analysis. ```dart import 'package:cactus/cactus.dart'; // Initialize VLM with both model and mmproj files final vlm = CactusVLM(); await vlm.init( modelFilename: 'nv-reason-cxr-3b-Q4_K_M.gguf', // Model file mmprojFilename: 'mmproj-nv-reason-cxr-3b-f16.gguf', // Vision encoder contextSize: 2048, // Context window (2K-4K for mobile) threads: 4, // CPU threads gpuLayers: 0, // CPU-only (GPU may cause issues on some devices) ); // Create prompt final messages = [ ChatMessage( role: 'system', content: 'You are a helpful radiologist assistant.', ), ChatMessage( role: 'user', content: 'Describe what you see in this chest X-ray image.', ), ]; // Analyze X-ray final response = await vlm.completion( messages, imagePaths: ['path/to/xray.jpg'], maxTokens: 150, temperature: 0.1, // Lower for medical analysis (0.1-0.5) ); print(response.text); ``` **Mobile GPU Benefits:** - 🚀 5.46x faster model loading (critical for app startup) - 📱 Better user experience on iOS (Metal) and Android (Vulkan/OpenCL) - 🔋 Minimal battery impact during loading phase - ✅ Falls back gracefully to CPU if GPU unavailable ## Inference Parameters **Recommended settings for medical analysis:** ```python { "temperature": 0.3, # Lower = more deterministic (range: 0.1-0.5) "top_p": 0.9, # Nucleus sampling "top_k": 40, # Top-k sampling "repeat_penalty": 1.1, # Avoid repetition "max_tokens": 512, # Response length "n_ctx": 4096, # Context window (2048-4096 for mobile) } ``` ## Files Included ``` . ├── README.md # Model card and usage guide ├── LICENSE # NSCLV1 license ├── CONVERSION_PROCESS.md # Technical conversion details ├── nv-reason-cxr-3b-Q4_K_M.gguf # Q4_K_M quantized (1.96 GB) - Recommended ├── nv-reason-cxr-3b-fp16.gguf # FP16 reference (6.3 GB) └── mmproj-nv-reason-cxr-3b-f16.gguf # Vision encoder (1.25 GB) - Required ``` ## Model Card ### Model Details - **Developed by:** NVIDIA (original), quantized by samwell - **Model type:** Vision-Language Model (VLM) - **Architecture:** Qwen2.5-VL - **Parameters:** 3 billion - **Language:** English - **License:** NSCLV1 (see LICENSE) - **Fine-tuned from:** Qwen2.5-VL-3B-Instruct - **Specialty:** Chest X-ray analysis ### Intended Use **Primary Use Cases:** - Research in medical image analysis - Educational purposes for radiology students - Prototyping mobile medical AI applications - Edge deployment of medical VLMs **Out-of-Scope:** - ❌ Clinical diagnosis or treatment decisions - ❌ Production medical applications without proper validation - ❌ Replacing trained radiologists - ❌ Any FDA-regulated medical use ### Limitations 1. **Not for Clinical Use:** This model is for research and educational purposes only 2. **Quality Trade-off:** Quantization reduces model size but may affect accuracy 3. **Domain Specific:** Trained primarily on chest X-rays, may not generalize to other imaging 4. **Requires Validation:** All outputs should be verified by medical professionals 5. **Mobile Performance:** Speed varies significantly by device capabilities ### Ethical Considerations - Model outputs should not be used for medical diagnosis - Always consult qualified healthcare professionals - Be aware of potential biases in training data - Ensure patient privacy when using with real medical images - Comply with local healthcare regulations (HIPAA, GDPR, etc.) ### Bias and Fairness The original model may have inherited biases from training data. Users should: - Test across diverse patient populations - Validate performance on their specific use cases - Monitor for unexpected outputs or biases - Not rely solely on model outputs ## Technical Details ### Quantization Method **Q4_K_M** uses 4-bit quantization with K-means clustering: - Weights stored in 4 bits instead of 16 (FP16) - K-means clustering for optimal quantization scales - Medium variant balances size and quality - Per-block scales for better accuracy preservation ### Vision Encoder The vision encoder has been extracted into a separate `mmproj` file for compatibility: - **File:** `mmproj-nv-reason-cxr-3b-f16.gguf` (1.25 GB) - **Required:** Both the model file AND mmproj file are needed for image analysis - **Format:** FP16 (full precision vision encoder) - **Extracted from:** NVIDIA's NV-Reason-CXR-3B original model - **Contains:** Vision transformer blocks and multimodal projection layers (519 tensors) **Why separate mmproj?** - Mobile frameworks (Cactus Compute) require separate mmproj architecture - Allows independent caching and loading strategies - Enables mixing different model quantizations with same vision encoder ### Context Window - **Training:** 128,000 tokens - **Recommended for mobile:** 2,048-4,096 tokens - **Desktop:** Up to 128,000 tokens (RAM-dependent) ## Citation If you use this model, please cite the original work: ```bibtex @misc{nvidia2024nvreasoncrx, title={NV-Reason-CXR-3B: A Specialized Vision-Language Model for Chest X-ray Analysis}, author={NVIDIA}, year={2024}, publisher={Hugging Face}, url={https://huggingface.co/nvidia/NV-Reason-CXR-3B} } ``` And optionally cite the quantization: ```bibtex @misc{nvreasoncrx3b-gguf, title={NV-Reason-CXR-3B GGUF: Quantized for Edge Deployment}, author={samwell}, year={2025}, publisher={Hugging Face}, url={https://huggingface.co/samwell/NV-Reason-CXR-3B-GGUF} } ``` ## Acknowledgments - **NVIDIA** for the original NV-Reason-CXR-3B model - **Qwen Team** for the Qwen2.5-VL architecture - **llama.cpp contributors** for the GGUF format and conversion tools - **Cactus Compute** for mobile VLM deployment framework ## License This model inherits the NSCLV1 license from the original [NV-Reason-CXR-3B](https://huggingface.co/nvidia/NV-Reason-CXR-3B) model. See [LICENSE](LICENSE) for details. **Key points:** - Research and educational use permitted - Commercial use may require additional permissions - Not for clinical/diagnostic use - See original model card for complete license terms ## Disclaimer ⚠️ **IMPORTANT MEDICAL DISCLAIMER** This model is provided for **RESEARCH AND EDUCATIONAL PURPOSES ONLY**. It is: - **NOT** intended for clinical diagnosis or treatment - **NOT** FDA approved or clinically validated - **NOT** a substitute for professional medical advice - **NOT** validated for production medical use Always consult qualified healthcare professionals for medical decisions. The creators and distributors of this model assume no liability for any use of this software. ## Contact & Support - **Issues:** Report issues on GitHub (link to your repo) - **Questions:** See documentation in this repository - **Original Model:** [nvidia/NV-Reason-CXR-3B](https://huggingface.co/nvidia/NV-Reason-CXR-3B) - **Cactus Compute:** [GitHub](https://github.com/cactus-compute/cactus) ## Version History - **v1.0** (2025-11-05): Initial release - FP16 GGUF conversion - Q4_K_M quantization - Tested on macOS and mobile projections - Complete documentation and scripts --- **For research and educational purposes only. Not for clinical use.**