Qwen3.5-0.8B-MTP-GGUF

Qwen3.5-0.8B from Alibaba's Qwen team is the smallest model in the Qwen3.5 family, an ultra-compact 0.8B-parameter dense multimodal language model with a hybrid Gated DeltaNet + sparse MoE architecture, 24 layers, 1024 hidden dimension, 248K vocabulary spanning 201 languages, multi-token prediction, and a massive 262K native context window (extensible to 1M+ tokens via YaRN) for unified text and image understanding at extreme efficiency. Designed under the "More Size, Less Waste" philosophy, it achieves 10.5 on the Artificial Analysis Intelligence Index—ranking #383 overall but exceptional for sub-1B models—while running at blazing-fast latencies (0.00s time-to-first-token #12 globally) with ~1.6GB VRAM requirement (BF16) or ~0.5GB in 4-bit quantization, making it ideal for Raspberry Pi, mobile phones, and embedded IoT devices. Apache 2.0-licensed with Ollama/vLLM/llama.cpp support, it excels at lightweight OCR, document parsing, multilingual chatbots, visual QA, and basic coding tasks as the most accessible entry point for on-device multimodal AI without requiring cloud dependencies.

Multi-Token Prediction (MTP) GGUF is a specialized GGUF model file format extension that integrates speculative decoding directly into the model weights to significantly accelerate local inference. Unlike traditional speculative decoding which requires a separate, smaller "draft" model, MTP GGUF files include additional output heads within the main model architecture that predict multiple future tokens in a single forward pass.

Model Files

File Name Quant Type File Size File Link
Qwen3.5-0.8B.BF16.gguf BF16 1.56 GB Download
Qwen3.5-0.8B.F16.gguf F16 1.56 GB Download
Qwen3.5-0.8B.Q2_K.gguf Q2_K 430 MB Download
Qwen3.5-0.8B.Q3_K_L.gguf Q3_K_L 502 MB Download
Qwen3.5-0.8B.Q3_K_M.gguf Q3_K_M 476 MB Download
Qwen3.5-0.8B.Q3_K_S.gguf Q3_K_S 444 MB Download
Qwen3.5-0.8B.Q4_0.gguf Q4_0 513 MB Download
Qwen3.5-0.8B.Q4_K_M.gguf Q4_K_M 542 MB Download
Qwen3.5-0.8B.Q4_K_S.gguf Q4_K_S 517 MB Download
Qwen3.5-0.8B.Q5_0.gguf Q5_0 578 MB Download
Qwen3.5-0.8B.Q5_K_M.gguf Q5_K_M 593 MB Download
Qwen3.5-0.8B.Q5_K_S.gguf Q5_K_S 578 MB Download
Qwen3.5-0.8B.Q6_K.gguf Q6_K 647 MB Download
Qwen3.5-0.8B.Q8_0.gguf Q8_0 834 MB Download
Qwen3.5-0.8B.mmproj-bf16.gguf mmproj-bf16 207 MB Download
Qwen3.5-0.8B.mmproj-f16.gguf mmproj-f16 207 MB Download
Qwen3.5-0.8B.mmproj-q8_0.gguf mmproj-q8_0 116 MB Download

Quants Usage

(sorted by size, not necessarily quality. IQ-quants are often preferable over similar sized non-IQ quants)

Here is a handy graph by ikawrakow comparing some lower-quality quant types (lower is better):

image.png

Downloads last month
1,940
GGUF
Model size
0.8B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for prithivMLmods/Qwen3.5-0.8B-MTP-GGUF

Quantized
(154)
this model

Collection including prithivMLmods/Qwen3.5-0.8B-MTP-GGUF