--- license: apache-2.0 base_model: - Qwen/Qwen3.6-35B-A3B language: - en tags: - GGUF - llama.cpp - qwen3.6 - qwen - quantization - turboquant - tq3_4s - multimodal - Mixture of Experts - conversational pipeline_tag: image-text-to-text --- ![thumbnail](thumbnail.png) # Qwen3.6-35B-A3B-TQ3_4S GGUF quantization of [`Qwen/Qwen3.6-35B-A3B`](https://huggingface.co/Qwen/Qwen3.6-35B-A3B) using **TQ3_4S** with mixed-precision MoE compression — 2-bit experts, 4-bit attention. ## Files | File | Description | |------|-------------| | `Qwen3.6-35B-A3B-TQ3_4S.gguf` | Main model (12.4 GiB, 3.07 BPW) | | `mmproj-BF16.gguf` | Multimodal projector (BF16) | ## Quantization MoE experts tolerate aggressive compression because only 8/256 are active per token. This quantization exploits that asymmetry: | Component | Quant | Rationale | |-----------|-------|-----------| | Expert MLP gate/up | Q2_K | 98% of params, MoE-tolerant | | Expert MLP down | Q3_K | Write-back sensitivity | | Attention Q/K/V/O | TQ3_4S | WHT-protected | | Embeddings + output | Q6_K | Quality anchor | ## Runtime Requirement This model requires the public TurboQuant runtime fork: * https://github.com/turbo-tan/llama.cpp-tq3 ## Recommended Settings (16GB VRAM) ```bash ./build/bin/llama-server \ -m Qwen3.6-35B-A3B-TQ3_4S.gguf \ -ngl 99 -c 4096 -np 1 \ -ctk q4_0 -ctv tq3_0 -fa on \ --jinja \ --reasoning off --reasoning-budget 0 --reasoning-format deepseek ``` With vision: ```bash ./build/bin/llama-server \ -m Qwen3.6-35B-A3B-TQ3_4S.gguf \ --mmproj mmproj-BF16.gguf \ -ngl 99 -c 4096 -np 1 \ -ctk q4_0 -ctv tq3_0 -fa on \ --jinja --no-mmproj-offload \ --reasoning off --reasoning-budget 0 --reasoning-format deepseek ``` ## Performance (RTX 5060 Ti 16GB) | Metric | Value | |--------|------:| | PP512 | 1832 tok/s | | TG128 | 107 tok/s | | Size | 12.4 GiB | | BPW | 3.07 | | ngl | 99 (full GPU) | Fits entirely in 16GB VRAM — no CPU offload needed. ## Quality 10/10 correct on standard QA benchmark (capital of France, 2+2, Python reverse string, gravity, WW2, primes, boiling point, Shakespeare, Jupiter, hello→Hola). ## Base Model * [`Qwen/Qwen3.6-35B-A3B`](https://huggingface.co/Qwen/Qwen3.6-35B-A3B) * Source: [`unsloth/Qwen3.6-35B-A3B-GGUF`](https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF) (Q8_0) ## License Apache 2.0 — same as the base model. ## Tool Call Validation Tested with `--jinja` on both `--reasoning off` and `--reasoning on --reasoning-budget 2048`: | Test | reasoning off | reasoning on | |------|:---:|:---:| | Basic tool call trigger | ✅ | ✅ | | Tool response → final answer (no loop) | ✅ | ✅ | | Correct tool selection from multiple | ✅ | ✅ | | No tool call for simple questions | ✅ | ✅ | | Multi-step tool use | ✅ | ✅ | | Nested quote escaping retry (no loop) | ✅ | ✅ | | **Total** | **10/10** | **10/10** | ### Recommended settings for tool-use / agentic workflows ```bash --jinja --reasoning off --reasoning-budget 0 --reasoning-format deepseek ``` Avoid `--presence-penalty` above 0.5 for tool-use — high values diversify reasoning tokens but don't improve structured JSON output, and can cause repeated near-identical tool calls in agent loops. If using `--reasoning on`, ensure your agent framework detects consecutive identical tool calls and breaks after 2-3 retries. ### Run tests yourself ```bash chmod +x test_tool_calls.sh ./test_tool_calls.sh 8085 ```