---
license: apache-2.0
base_model: atbender/Qwen3.6-VL-REAP-26B-A3B
library_name: gguf
pipeline_tag: image-text-to-text
tags:
  - gguf
  - quantization
  - llama-cpp
  - qwen3.6
  - qwen3-vl
  - vision-language
  - multimodal
  - mixture-of-experts
  - reap
  - pruned
---

# Qwen3.6-VL-REAP-26B-A3B — GGUF

GGUF quantizations of [atbender/Qwen3.6-VL-REAP-26B-A3B](https://huggingface.co/atbender/Qwen3.6-VL-REAP-26B-A3B), a REAP-pruned variant of Qwen3.6-VL. **Both the language model (text quants) and the vision tower (mmproj) are included** - drop the mmproj alongside any text quant for full multimodal (image + text) inference.

## Files

| File | Quant | Size |
| --- | --- | ---: |
| `Qwen3.6-VL-REAP-26B-A3B-text-Q4_K_M.gguf` | Q4_K_M | ~15 GB |
| `Qwen3.6-VL-REAP-26B-A3B-text-IQ4_XS.gguf` | IQ4_XS | ~14 GB |
| `Qwen3.6-VL-REAP-26B-A3B-text-Q3_K_S.gguf` | Q3_K_S | ~11 GB |
| `mmproj-REAP-26B-F16.gguf` | F16 (vision tower) | ~860 MB |

## Quality vs bf16 (wikitext-2-raw, llama.cpp perplexity)

Each text quant was scored against the bf16 reference using `llama-perplexity` on the wikitext-2-raw test split (580 chunks, n_ctx=512, ~297k tokens). Bench run on a single RTX PRO 6000 (Blackwell, 96 GB).

| Quant | PPL | ΔPPL vs bf16 | Mean KLD | Top-1 token agree |
| --- | ---: | ---: | ---: | ---: |
| **bf16** (reference) | **9.2369** | — | 0 | 100% |
| Q4_K_M | 9.3858 | **+1.62%** | **0.0449** | **90.41%** |
| IQ4_XS | 9.4293 | +2.08% | 0.0457 | 90.03% |
| Q3_K_S | 10.4822 | +13.51% | 0.1626 | 81.85% |

On Apple Silicon (llama.cpp, q8_0 KV cache): **Q4_K_M** had the best speed/quality trade-off across both standalone code-gen and agentic tasks. Q3_K_S held up reasonably well on quality at a smaller footprint, and IQ4_XS produced correct outputs but ran noticeably slower in the same harness. Your mileage may vary depending on your hardware and setup.

## Usage

### Text-only

```bash
llama-cli -m Qwen3.6-VL-REAP-26B-A3B-text-Q4_K_M.gguf -cnv
```

### Vision-language (multimodal)

The vision tower is loaded via `--mmproj`. Both `llama-mtmd-cli` (one-shot image+prompt) and `llama-server` (OpenAI-compatible HTTP server with image input) are supported.

**One-shot CLI:**

```bash
llama-mtmd-cli \
    -m Qwen3.6-VL-REAP-26B-A3B-text-Q4_K_M.gguf \
    --mmproj mmproj-REAP-26B-F16.gguf \
    --image path/to/photo.jpg \
    -p "Describe this image."
```

**Server (OpenAI-compatible `/v1/chat/completions` with `image_url`):**

```bash
llama-server \
    -m Qwen3.6-VL-REAP-26B-A3B-text-Q4_K_M.gguf \
    --mmproj mmproj-REAP-26B-F16.gguf \
    --port 8080
```

Then send a chat-completions request with an `image_url` content part (data URL or http URL) — the server routes it through the mmproj automatically. Capability is advertised as `multimodal` on the `/v1/models` endpoint when `--mmproj` is set.

#### Notes on the vision tower

- Converted from atbender's source vision encoder weights (BF16) to GGUF F16 via llama.cpp's `convert_hf_to_gguf.py --mmproj` pipeline.
- Validated end-to-end with `llama-mtmd-cli` on two test images (a real-world product photo and a music album scan); both produced accurate descriptions including readable on-image text, matching expectations from the bf16 reference.
- F16 was kept (not quantized further) because the vision tower is small (~860 MB) and quality-sensitive; the marginal disk savings of FP8/Q8 don't justify the risk of degrading image grounding.

## Acknowledgements

- Base model (REAP-pruned): [atbender/Qwen3.6-VL-REAP-26B-A3B](https://huggingface.co/atbender/Qwen3.6-VL-REAP-26B-A3B)
- Upstream architecture: [Qwen3-VL](https://huggingface.co/Qwen) (Alibaba)
- REAP pruning method: [Cerebras Research](https://github.com/CerebrasResearch/reap)
- Quantized with [llama.cpp](https://github.com/ggerganov/llama.cpp)

License inherited from the base model (Apache 2.0).