---
license: apache-2.0
base_model:
  - Qwen/Qwen3.6-35B-A3B
language:
  - en
tags:
  - GGUF
  - llama.cpp
  - qwen3.6
  - qwen
  - quantization
  - turboquant
  - tq3_4s
  - multimodal
  - Mixture of Experts
  - conversational
pipeline_tag: image-text-to-text
---

![thumbnail](thumbnail.png)

# Qwen3.6-35B-A3B-TQ3_4S

GGUF quantization of [`Qwen/Qwen3.6-35B-A3B`](https://huggingface.co/Qwen/Qwen3.6-35B-A3B) using **TQ3_4S** with mixed-precision MoE compression — 2-bit experts, 4-bit attention.

## Files

| File | Description |
|------|-------------|
| `Qwen3.6-35B-A3B-TQ3_4S.gguf` | Main model (12.4 GiB, 3.07 BPW) |
| `mmproj-BF16.gguf` | Multimodal projector (BF16) |

## Quantization

MoE experts tolerate aggressive compression because only 8/256 are active per token. This quantization exploits that asymmetry:

| Component | Quant | Rationale |
|-----------|-------|-----------|
| Expert MLP gate/up | Q2_K | 98% of params, MoE-tolerant |
| Expert MLP down | Q3_K | Write-back sensitivity |
| Attention Q/K/V/O | TQ3_4S | WHT-protected |
| Embeddings + output | Q6_K | Quality anchor |

## Runtime Requirement

This model requires the public TurboQuant runtime fork:
* https://github.com/turbo-tan/llama.cpp-tq3

## Recommended Settings (16GB VRAM)

```bash
./build/bin/llama-server \
  -m Qwen3.6-35B-A3B-TQ3_4S.gguf \
  -ngl 99 -c 4096 -np 1 \
  -ctk q4_0 -ctv tq3_0 -fa on \
  --jinja \
  --reasoning off --reasoning-budget 0 --reasoning-format deepseek
```

With vision:

```bash
./build/bin/llama-server \
  -m Qwen3.6-35B-A3B-TQ3_4S.gguf \
  --mmproj mmproj-BF16.gguf \
  -ngl 99 -c 4096 -np 1 \
  -ctk q4_0 -ctv tq3_0 -fa on \
  --jinja --no-mmproj-offload \
  --reasoning off --reasoning-budget 0 --reasoning-format deepseek
```

## Performance (RTX 5060 Ti 16GB)

| Metric | Value |
|--------|------:|
| PP512 | 1832 tok/s |
| TG128 | 107 tok/s |
| Size | 12.4 GiB |
| BPW | 3.07 |
| ngl | 99 (full GPU) |

Fits entirely in 16GB VRAM — no CPU offload needed.

## Quality

10/10 correct on standard QA benchmark (capital of France, 2+2, Python reverse string, gravity, WW2, primes, boiling point, Shakespeare, Jupiter, hello→Hola).

## Base Model

* [`Qwen/Qwen3.6-35B-A3B`](https://huggingface.co/Qwen/Qwen3.6-35B-A3B)
* Source: [`unsloth/Qwen3.6-35B-A3B-GGUF`](https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF) (Q8_0)

## License

Apache 2.0 — same as the base model.

## Tool Call Validation

Tested with `--jinja` on both `--reasoning off` and `--reasoning on --reasoning-budget 2048`:

| Test | reasoning off | reasoning on |
|------|:---:|:---:|
| Basic tool call trigger | ✅ | ✅ |
| Tool response → final answer (no loop) | ✅ | ✅ |
| Correct tool selection from multiple | ✅ | ✅ |
| No tool call for simple questions | ✅ | ✅ |
| Multi-step tool use | ✅ | ✅ |
| Nested quote escaping retry (no loop) | ✅ | ✅ |
| **Total** | **10/10** | **10/10** |

### Recommended settings for tool-use / agentic workflows

```bash
--jinja --reasoning off --reasoning-budget 0 --reasoning-format deepseek
```

Avoid `--presence-penalty` above 0.5 for tool-use — high values diversify reasoning tokens but don't improve structured JSON output, and can cause repeated near-identical tool calls in agent loops.

If using `--reasoning on`, ensure your agent framework detects consecutive identical tool calls and breaks after 2-3 retries.

### Run tests yourself

```bash
chmod +x test_tool_calls.sh
./test_tool_calls.sh 8085
```