How to use from
Pi
Start the llama.cpp server
# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf JZC973/Qwen3.6-35B-REAP-MTP-UD-GGUF-Collection:
Configure the model in Pi
# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "JZC973/Qwen3.6-35B-REAP-MTP-UD-GGUF-Collection:"
        }
      ]
    }
  }
}
Run Pi
# Start Pi in your project directory:
pi
Quick Links

YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Qwen 3.6 REAP-Pruned MTP GGUF Collection

This archive contains a collection of pruned models derived from Unsloth's Qwen 3.6 35B MTP GGUF. These models have been pruned using the REAP (Routed-Expert Pruning) strategy while carefully maintaining the Multi-Token Prediction (MTP) layer (Layer 40) functionality.

Models Included

1. ATBender Configuration (192 Experts)

Pruned based on the atbender/Qwen3.6-VL-REAP-26B-A3B strategy.

  • Qwen3.6-35B-A3B-UD-IQ3_S-REAP.gguf
  • Qwen3.6-35B-A3B-UD-IQ3_XXS-REAP.gguf
  • Qwen3.6-35B-A3B-UD-Q3_K_M-REAP.gguf
  • Qwen3.6-35B-A3B-UD-Q3_K_XL-REAP.gguf

2. RangerX Configuration (180 Experts, Ratio 0.3)

Pruned based on the RangerX/Qwen3.6-35B-REAP-Pruned-ratio-0.3 strategy (reverse-engineered from router weights).

  • Qwen3.6-35B-A3B-UD-IQ3_S-REAP-RangerX.gguf
  • Qwen3.6-35B-A3B-UD-IQ3_XXS-REAP-RangerX.gguf
  • Qwen3.6-35B-A3B-UD-Q3_K_M-REAP-RangerX.gguf
  • Qwen3.6-35B-A3B-UD-Q3_K_XL-REAP-RangerX.gguf

Creation Process

These models were created using a custom Python script that:

  1. Identified Kept Experts: For the ATBender set, indices were pulled from reap_metadata.json. For the RangerX set, indices were reverse-engineered by comparing the pruned router gate weights against the original unpruned gates via cosine similarity.
  2. Surgical Slicing: The script directly sliced the GGUF tensors along the expert dimension (dim0 in the raw memory layout) without dequantizing the individual blocks.
  3. MTP Preservation: All MTP-specific tensors (Layer 40) were handled correctly. For tensors with expert dimensions in Layer 40, the Layer 39 pruning strategy was applied to maintain architectural consistency.
  4. Metadata Updates: The qwen35moe.expert_count was updated in the GGUF headers to ensure compatibility with llama.cpp.

Usage

To run with MTP support, use a recent build of llama.cpp (May 2026 or later) with the following flag: --spec-type draft-mtp --spec-draft-n-max 3


Created by Gemini CLI on May 21, 2026.

Downloads last month
2,508
GGUF
Model size
26B params
Architecture
qwen35moe
Hardware compatibility
Log In to add your hardware

3-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support