--- license: apache-2.0 base_model: atbender/Qwen3.6-VL-REAP-26B-A3B library_name: gguf pipeline_tag: image-text-to-text tags: - gguf - quantization - llama-cpp - qwen3.6 - qwen3-vl - vision-language - multimodal - mixture-of-experts - reap - pruned --- # Qwen3.6-VL-REAP-26B-A3B — GGUF GGUF quantizations of [atbender/Qwen3.6-VL-REAP-26B-A3B](https://huggingface.co/atbender/Qwen3.6-VL-REAP-26B-A3B), a REAP-pruned variant of Qwen3.6-VL. **Both the language model (text quants) and the vision tower (mmproj) are included** - drop the mmproj alongside any text quant for full multimodal (image + text) inference. ## Files | File | Quant | Size | | --- | --- | ---: | | `Qwen3.6-VL-REAP-26B-A3B-text-Q4_K_M.gguf` | Q4_K_M | ~15 GB | | `Qwen3.6-VL-REAP-26B-A3B-text-IQ4_XS.gguf` | IQ4_XS | ~14 GB | | `Qwen3.6-VL-REAP-26B-A3B-text-Q3_K_S.gguf` | Q3_K_S | ~11 GB | | `mmproj-REAP-26B-F16.gguf` | F16 (vision tower) | ~860 MB | ## Quality vs bf16 (wikitext-2-raw, llama.cpp perplexity) Each text quant was scored against the bf16 reference using `llama-perplexity` on the wikitext-2-raw test split (580 chunks, n_ctx=512, ~297k tokens). Bench run on a single RTX PRO 6000 (Blackwell, 96 GB). | Quant | PPL | ΔPPL vs bf16 | Mean KLD | Top-1 token agree | | --- | ---: | ---: | ---: | ---: | | **bf16** (reference) | **9.2369** | — | 0 | 100% | | Q4_K_M | 9.3858 | **+1.62%** | **0.0449** | **90.41%** | | IQ4_XS | 9.4293 | +2.08% | 0.0457 | 90.03% | | Q3_K_S | 10.4822 | +13.51% | 0.1626 | 81.85% | On Apple Silicon (llama.cpp, q8_0 KV cache): **Q4_K_M** had the best speed/quality trade-off across both standalone code-gen and agentic tasks. Q3_K_S held up reasonably well on quality at a smaller footprint, and IQ4_XS produced correct outputs but ran noticeably slower in the same harness. Your mileage may vary depending on your hardware and setup. ## Usage ### Text-only ```bash llama-cli -m Qwen3.6-VL-REAP-26B-A3B-text-Q4_K_M.gguf -cnv ``` ### Vision-language (multimodal) The vision tower is loaded via `--mmproj`. Both `llama-mtmd-cli` (one-shot image+prompt) and `llama-server` (OpenAI-compatible HTTP server with image input) are supported. **One-shot CLI:** ```bash llama-mtmd-cli \ -m Qwen3.6-VL-REAP-26B-A3B-text-Q4_K_M.gguf \ --mmproj mmproj-REAP-26B-F16.gguf \ --image path/to/photo.jpg \ -p "Describe this image." ``` **Server (OpenAI-compatible `/v1/chat/completions` with `image_url`):** ```bash llama-server \ -m Qwen3.6-VL-REAP-26B-A3B-text-Q4_K_M.gguf \ --mmproj mmproj-REAP-26B-F16.gguf \ --port 8080 ``` Then send a chat-completions request with an `image_url` content part (data URL or http URL) — the server routes it through the mmproj automatically. Capability is advertised as `multimodal` on the `/v1/models` endpoint when `--mmproj` is set. #### Notes on the vision tower - Converted from atbender's source vision encoder weights (BF16) to GGUF F16 via llama.cpp's `convert_hf_to_gguf.py --mmproj` pipeline. - Validated end-to-end with `llama-mtmd-cli` on two test images (a real-world product photo and a music album scan); both produced accurate descriptions including readable on-image text, matching expectations from the bf16 reference. - F16 was kept (not quantized further) because the vision tower is small (~860 MB) and quality-sensitive; the marginal disk savings of FP8/Q8 don't justify the risk of degrading image grounding. ## Acknowledgements - Base model (REAP-pruned): [atbender/Qwen3.6-VL-REAP-26B-A3B](https://huggingface.co/atbender/Qwen3.6-VL-REAP-26B-A3B) - Upstream architecture: [Qwen3-VL](https://huggingface.co/Qwen) (Alibaba) - REAP pruning method: [Cerebras Research](https://github.com/CerebrasResearch/reap) - Quantized with [llama.cpp](https://github.com/ggerganov/llama.cpp) License inherited from the base model (Apache 2.0).