---
license: other
license_name: nsclv1
license_link: https://huggingface.co/nvidia/NV-Reason-CXR-3B/blob/main/LICENSE
base_model: nvidia/NV-Reason-CXR-3B
tags:
  - medical
  - x-ray
  - vision-language
  - gguf
  - quantized
  - mobile
  - cxr
  - radiology
  - qwen2.5-vl
  - llama.cpp
  - cactus-compute
model_type: qwen2vl
language:
  - en
pipeline_tag: image-text-to-text
---

# NV-Reason-CXR-3B GGUF (Quantized for Edge)

Quantized GGUF versions of NVIDIA's [NV-Reason-CXR-3B](https://huggingface.co/nvidia/NV-Reason-CXR-3B) vision-language model optimized for edge deployment for [Cactus Compute](https://github.com/cactus-compute/cactus) and [llama.cpp](https://github.com/ggerganov/llama.cpp).

## Model Description

This repository contains quantized versions of NV-Reason-CXR-3B, a 3B parameter vision-language model specialized in chest X-ray analysis. The model has been converted to GGUF format and quantized for efficient deployment on edge devices (mobile, desktop, embedded systems).

**Original Model:** [nvidia/NV-Reason-CXR-3B](https://huggingface.co/nvidia/NV-Reason-CXR-3B)
**Base Architecture:** Qwen2.5-VL 3B Instruct
**Conversion:** llama.cpp
**Quantization:** llama-cpp-python

## Available Models

| Filename | Format | Size | Use Case | Quality | Speed |
|----------|--------|------|----------|---------|-------|
| `nv-reason-cxr-3b-fp16.gguf` | FP16 | 6.3 GB | Desktop with GPU (quality reference) | 100% | Baseline |
| `nv-reason-cxr-3b-Q4_K_M.gguf` | Q4_K_M | 1.96 GB | **Recommended for edge devices** | 90-95% | Fast |
| `mmproj-nv-reason-cxr-3b-f16.gguf` | FP16 mmproj | 1.25 GB | **Vision encoder (required for image analysis)** | 100% | - |

### Model Details

**Q4_K_M (Recommended):**
- **Size:** 1.96 GB (69% reduction from FP16)
- **Compression:** 3.23x from original
- **Quality:** 90-95% retention
- **Speed:** 8-20 tokens/sec on mobile (device-dependent)
- **RAM Required:** 3-4 GB
- **Best for:** Mid-range to high-end mobile devices

**FP16 (Reference):**
- **Size:** 6.3 GB
- **Quality:** Original precision
- **Speed:** Slower than quantized
- **RAM Required:** 8+ GB
- **Best for:** Desktop inference, quality comparison

## Performance Benchmarks

### Desktop (Apple M3 Mac)

**Q4_K_M Performance:**

| Configuration | Load Time | Inference Speed | Memory Usage |
|--------------|-----------|-----------------|--------------|
| CPU-only | 1.87s | 29.61 tok/s | ~2 GB RAM |
| **M3 GPU (Metal)** | **0.34s** | **33.24 tok/s** | ~2 GB RAM |
| **Speedup** | **5.46x faster** ⚡ | 1.12x faster | Same |

**Key Insights:**
- 🚀 **GPU provides 5.46x faster model loading** - Huge benefit for app cold starts!
- ⚡ Modest 1.12x inference speedup - Q4_K_M is already highly CPU-optimized
- ✅ **Excellent CPU performance** - GPU acceleration is optional, not required
- 💪 Mobile devices will run well even without dedicated GPU

**Test Hardware:** Apple M3 MacBook Pro (Metal GPU support)

### Mobile Projections

| Device | RAM | Expected Speed | Load Time | Rating |
|--------|-----|----------------|-----------|--------|
| Budget Android | 3GB | 3-5 tok/s | 30-45s | Poor |
| Mid-range Android | 4GB | 8-12 tok/s | 20-30s | Good |
| High-end Android | 6GB | 15-20 tok/s | 15-25s | Excellent |
| iPhone 12+ | 4-6GB | 12-18 tok/s | 15-20s | Excellent |
| iPhone 14+ | 6GB+ | 18-25 tok/s | 10-15s | Optimal |

**Minimum Requirements:**
- 4GB RAM
- 3GB free storage
- iOS 14+ or Android 8+

## Usage

### With llama.cpp

```bash
# Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

# Download model
huggingface-cli download samwell/NV-Reason-CXR-3B-GGUF \
  nv-reason-cxr-3b-Q4_K_M.gguf --local-dir ./models

# Run inference
./llama-cli \
  -m models/nv-reason-cxr-3b-Q4_K_M.gguf \
  -p "Analyze this chest X-ray image." \
  --image xray.jpg \
  -n 512 \
  --temp 0.3
```

### With llama-cpp-python

```python
from llama_cpp import Llama

# Option 1: CPU-only (works great, 29.61 tok/s on M3)
llm = Llama(
    model_path="nv-reason-cxr-3b-Q4_K_M.gguf",
    n_ctx=4096,
    n_threads=4,
    n_gpu_layers=0,  # CPU-only
)

# Option 2: GPU acceleration (5.46x faster loading!)
llm = Llama(
    model_path="nv-reason-cxr-3b-Q4_K_M.gguf",
    n_ctx=4096,
    n_threads=4,
    n_gpu_layers=-1,  # Use GPU (Metal on Mac, CUDA on Linux/Windows)
)

# Analyze X-ray
response = llm(
    "Analyze this chest X-ray image and identify key findings.",
    max_tokens=512,
    temperature=0.3,  # Lower for medical = more deterministic
    top_p=0.9,
)

print(response['choices'][0]['text'])
```

### With Cactus Compute (Flutter/Mobile)

**Note:** You need BOTH the model file AND the mmproj file for image analysis.

```dart
import 'package:cactus/cactus.dart';

// Initialize VLM with both model and mmproj files
final vlm = CactusVLM();
await vlm.init(
  modelFilename: 'nv-reason-cxr-3b-Q4_K_M.gguf',      // Model file
  mmprojFilename: 'mmproj-nv-reason-cxr-3b-f16.gguf', // Vision encoder
  contextSize: 2048,    // Context window (2K-4K for mobile)
  threads: 4,           // CPU threads
  gpuLayers: 0,         // CPU-only (GPU may cause issues on some devices)
);

// Create prompt
final messages = [
  ChatMessage(
    role: 'system',
    content: 'You are a helpful radiologist assistant.',
  ),
  ChatMessage(
    role: 'user',
    content: 'Describe what you see in this chest X-ray image.',
  ),
];

// Analyze X-ray
final response = await vlm.completion(
  messages,
  imagePaths: ['path/to/xray.jpg'],
  maxTokens: 150,
  temperature: 0.1,     // Lower for medical analysis (0.1-0.5)
);

print(response.text);
```

**Mobile GPU Benefits:**
- 🚀 5.46x faster model loading (critical for app startup)
- 📱 Better user experience on iOS (Metal) and Android (Vulkan/OpenCL)
- 🔋 Minimal battery impact during loading phase
- ✅ Falls back gracefully to CPU if GPU unavailable

## Inference Parameters

**Recommended settings for medical analysis:**

```python
{
    "temperature": 0.3,      # Lower = more deterministic (range: 0.1-0.5)
    "top_p": 0.9,            # Nucleus sampling
    "top_k": 40,             # Top-k sampling
    "repeat_penalty": 1.1,   # Avoid repetition
    "max_tokens": 512,       # Response length
    "n_ctx": 4096,          # Context window (2048-4096 for mobile)
}
```

## Files Included

```
.
├── README.md                               # Model card and usage guide
├── LICENSE                                 # NSCLV1 license
├── CONVERSION_PROCESS.md                   # Technical conversion details
├── nv-reason-cxr-3b-Q4_K_M.gguf           # Q4_K_M quantized (1.96 GB) - Recommended
├── nv-reason-cxr-3b-fp16.gguf             # FP16 reference (6.3 GB)
└── mmproj-nv-reason-cxr-3b-f16.gguf       # Vision encoder (1.25 GB) - Required
```

## Model Card

### Model Details

- **Developed by:** NVIDIA (original), quantized by samwell
- **Model type:** Vision-Language Model (VLM)
- **Architecture:** Qwen2.5-VL
- **Parameters:** 3 billion
- **Language:** English
- **License:** NSCLV1 (see LICENSE)
- **Fine-tuned from:** Qwen2.5-VL-3B-Instruct
- **Specialty:** Chest X-ray analysis

### Intended Use

**Primary Use Cases:**
- Research in medical image analysis
- Educational purposes for radiology students
- Prototyping mobile medical AI applications
- Edge deployment of medical VLMs

**Out-of-Scope:**
- ❌ Clinical diagnosis or treatment decisions
- ❌ Production medical applications without proper validation
- ❌ Replacing trained radiologists
- ❌ Any FDA-regulated medical use

### Limitations

1. **Not for Clinical Use:** This model is for research and educational purposes only
2. **Quality Trade-off:** Quantization reduces model size but may affect accuracy
3. **Domain Specific:** Trained primarily on chest X-rays, may not generalize to other imaging
4. **Requires Validation:** All outputs should be verified by medical professionals
5. **Mobile Performance:** Speed varies significantly by device capabilities

### Ethical Considerations

- Model outputs should not be used for medical diagnosis
- Always consult qualified healthcare professionals
- Be aware of potential biases in training data
- Ensure patient privacy when using with real medical images
- Comply with local healthcare regulations (HIPAA, GDPR, etc.)

### Bias and Fairness

The original model may have inherited biases from training data. Users should:
- Test across diverse patient populations
- Validate performance on their specific use cases
- Monitor for unexpected outputs or biases
- Not rely solely on model outputs

## Technical Details

### Quantization Method

**Q4_K_M** uses 4-bit quantization with K-means clustering:
- Weights stored in 4 bits instead of 16 (FP16)
- K-means clustering for optimal quantization scales
- Medium variant balances size and quality
- Per-block scales for better accuracy preservation

### Vision Encoder

The vision encoder has been extracted into a separate `mmproj` file for compatibility:
- **File:** `mmproj-nv-reason-cxr-3b-f16.gguf` (1.25 GB)
- **Required:** Both the model file AND mmproj file are needed for image analysis
- **Format:** FP16 (full precision vision encoder)
- **Extracted from:** NVIDIA's NV-Reason-CXR-3B original model
- **Contains:** Vision transformer blocks and multimodal projection layers (519 tensors)

**Why separate mmproj?**
- Mobile frameworks (Cactus Compute) require separate mmproj architecture
- Allows independent caching and loading strategies
- Enables mixing different model quantizations with same vision encoder

### Context Window

- **Training:** 128,000 tokens
- **Recommended for mobile:** 2,048-4,096 tokens
- **Desktop:** Up to 128,000 tokens (RAM-dependent)

## Citation

If you use this model, please cite the original work:

```bibtex
@misc{nvidia2024nvreasoncrx,
  title={NV-Reason-CXR-3B: A Specialized Vision-Language Model for Chest X-ray Analysis},
  author={NVIDIA},
  year={2024},
  publisher={Hugging Face},
  url={https://huggingface.co/nvidia/NV-Reason-CXR-3B}
}
```

And optionally cite the quantization:

```bibtex
@misc{nvreasoncrx3b-gguf,
  title={NV-Reason-CXR-3B GGUF: Quantized for Edge Deployment},
  author={samwell},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/samwell/NV-Reason-CXR-3B-GGUF}
}
```

## Acknowledgments

- **NVIDIA** for the original NV-Reason-CXR-3B model
- **Qwen Team** for the Qwen2.5-VL architecture
- **llama.cpp contributors** for the GGUF format and conversion tools
- **Cactus Compute** for mobile VLM deployment framework

## License

This model inherits the NSCLV1 license from the original [NV-Reason-CXR-3B](https://huggingface.co/nvidia/NV-Reason-CXR-3B) model. See [LICENSE](LICENSE) for details.

**Key points:**
- Research and educational use permitted
- Commercial use may require additional permissions
- Not for clinical/diagnostic use
- See original model card for complete license terms

## Disclaimer

⚠️ **IMPORTANT MEDICAL DISCLAIMER**

This model is provided for **RESEARCH AND EDUCATIONAL PURPOSES ONLY**. It is:

- **NOT** intended for clinical diagnosis or treatment
- **NOT** FDA approved or clinically validated
- **NOT** a substitute for professional medical advice
- **NOT** validated for production medical use

Always consult qualified healthcare professionals for medical decisions. The creators and distributors of this model assume no liability for any use of this software.

## Contact & Support

- **Issues:** Report issues on GitHub (link to your repo)
- **Questions:** See documentation in this repository
- **Original Model:** [nvidia/NV-Reason-CXR-3B](https://huggingface.co/nvidia/NV-Reason-CXR-3B)
- **Cactus Compute:** [GitHub](https://github.com/cactus-compute/cactus)

## Version History

- **v1.0** (2025-11-05): Initial release
  - FP16 GGUF conversion
  - Q4_K_M quantization
  - Tested on macOS and mobile projections
  - Complete documentation and scripts

---

**For research and educational purposes only. Not for clinical use.**