Instructions to use Kevletesteur/Qwen3.5-35B-A3B-Chimere-v3-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use Kevletesteur/Qwen3.5-35B-A3B-Chimere-v3-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="Kevletesteur/Qwen3.5-35B-A3B-Chimere-v3-GGUF", filename="chimere-v3-ramp.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use Kevletesteur/Qwen3.5-35B-A3B-Chimere-v3-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf Kevletesteur/Qwen3.5-35B-A3B-Chimere-v3-GGUF # Run inference directly in the terminal: llama-cli -hf Kevletesteur/Qwen3.5-35B-A3B-Chimere-v3-GGUF
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf Kevletesteur/Qwen3.5-35B-A3B-Chimere-v3-GGUF # Run inference directly in the terminal: llama-cli -hf Kevletesteur/Qwen3.5-35B-A3B-Chimere-v3-GGUF
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf Kevletesteur/Qwen3.5-35B-A3B-Chimere-v3-GGUF # Run inference directly in the terminal: ./llama-cli -hf Kevletesteur/Qwen3.5-35B-A3B-Chimere-v3-GGUF
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf Kevletesteur/Qwen3.5-35B-A3B-Chimere-v3-GGUF # Run inference directly in the terminal: ./build/bin/llama-cli -hf Kevletesteur/Qwen3.5-35B-A3B-Chimere-v3-GGUF
Use Docker
docker model run hf.co/Kevletesteur/Qwen3.5-35B-A3B-Chimere-v3-GGUF
- LM Studio
- Jan
- vLLM
How to use Kevletesteur/Qwen3.5-35B-A3B-Chimere-v3-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Kevletesteur/Qwen3.5-35B-A3B-Chimere-v3-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Kevletesteur/Qwen3.5-35B-A3B-Chimere-v3-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/Kevletesteur/Qwen3.5-35B-A3B-Chimere-v3-GGUF
- Ollama
How to use Kevletesteur/Qwen3.5-35B-A3B-Chimere-v3-GGUF with Ollama:
ollama run hf.co/Kevletesteur/Qwen3.5-35B-A3B-Chimere-v3-GGUF
- Unsloth Studio
How to use Kevletesteur/Qwen3.5-35B-A3B-Chimere-v3-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Kevletesteur/Qwen3.5-35B-A3B-Chimere-v3-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Kevletesteur/Qwen3.5-35B-A3B-Chimere-v3-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for Kevletesteur/Qwen3.5-35B-A3B-Chimere-v3-GGUF to start chatting
- Pi
How to use Kevletesteur/Qwen3.5-35B-A3B-Chimere-v3-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf Kevletesteur/Qwen3.5-35B-A3B-Chimere-v3-GGUF
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "Kevletesteur/Qwen3.5-35B-A3B-Chimere-v3-GGUF" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use Kevletesteur/Qwen3.5-35B-A3B-Chimere-v3-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf Kevletesteur/Qwen3.5-35B-A3B-Chimere-v3-GGUF
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default Kevletesteur/Qwen3.5-35B-A3B-Chimere-v3-GGUF
Run Hermes
hermes
- Docker Model Runner
How to use Kevletesteur/Qwen3.5-35B-A3B-Chimere-v3-GGUF with Docker Model Runner:
docker model run hf.co/Kevletesteur/Qwen3.5-35B-A3B-Chimere-v3-GGUF
- Lemonade
How to use Kevletesteur/Qwen3.5-35B-A3B-Chimere-v3-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull Kevletesteur/Qwen3.5-35B-A3B-Chimere-v3-GGUF
Run and chat with the model
lemonade run user.Qwen3.5-35B-A3B-Chimere-v3-GGUF-{{QUANT_TAG}}List all available models
lemonade list
Qwen3.5-35B-A3B Chimere v3 -- RAMP GGUF
Chimere v3: Claude Opus 4.6 distillation of Qwen3.5-35B-A3B, optimized for instruction following and reasoning.
RAMP quantization (per-tensor quality overrides + imatrix) -- 15 GB, fits 16 GB VRAM, ~80 tok/s on RTX 5060 Ti.
Looking for v1 (best code + tools)? See Chimere v1 GGUF.
Compatible runtimes
This GGUF can be loaded by any runtime that supports the Qwen3.5-35B-A3B (qwen35moe) architecture. The reference runtime — and the one that exercises all chimere-specific features (Engram n-gram bias, multi-agent context switching, the C++ fast sampler with DRY + min-p, K-cache Hadamard rotation, fused MoE up/gate) — is chimere-server.
| Runtime | Engram | Multi-agent | DRY sampler | K-cache Hadamard | Notes |
|---|---|---|---|---|---|
| chimere-server (Rust, official) | yes | yes | yes (C++ fast path) | yes | Production target. Also runs Mamba-2 / Nemotron-H MoE through the same backend (PR ikawrakow/ik_llama.cpp#1593). |
ik_llama.cpp llama-server |
no | no | optional | optional | Same backend that chimere-server links against, just without the Rust HTTP/sampling layer. |
llama.cpp stock llama-server |
no | no | no | no | Works, but slower on Qwen3.5 MoE on our hardware (no iqk matmul, no fused MoE up/gate). |
Benchmark Results
v3 strengths: instructions and reasoning
| Benchmark | v3 RAMP (this repo) | v1 RAMP | Base Qwen3.5-35B-A3B | Notes |
|---|---|---|---|---|
| IFEval (15 instruction tests) | 100% | 67% | ~91.9% | +33 pts vs v1 |
| Edge cases (15 adversarial tests) | 100% | 87% | -- | Perfect prompt injection resistance |
| GSM8K CoT 8-shot (1,319 qs) | 84.0% | 52.2% | -- | +32 pts vs v1 |
| HumanEval (30 problems, executed) | 83% | 97% | -- | v1 better here |
| BFCL tool-calling (20 questions) | 75% | 90% | 67.3% | v1 better here |
| Speed (RTX 5060 Ti 16 GB, chimere-server) | ~80 tok/s | ~80 tok/s | -- | NCMOE=3, ctx 64K |
Qualitative agentic tests
| Scenario | v3 | v1 | /10 |
|---|---|---|---|
| Cybersecurity incident response (multi-tool chain) | 4 | 4 | 10 |
| ML pipeline architecture (RAG, 10K users, $50K budget) | 8 | 8 | 10 |
| Rust MoE runtime optimization (async prefetch, CUDA) | 8 | 7 | 10 |
| Total | 20 | 19 | 30 |
Honest assessment
- Strengths: 100% IFEval, 100% adversarial edge cases, 84% GSM8K, best overall reasoning
- Weaknesses: Code generation slightly weaker (83% vs 97%), tool-calling lower (75% vs 90%)
- Why: v3 dataset added IFEval-strict, OPSDC-compressed reasoning, and instruction-following samples on top of v1 base. Recommended for general agentic use.
Which version to use?
| Use case | Recommended | Why |
|---|---|---|
| Instruction following, formatting | v3 (this repo) | 100% IFEval, 100% edge cases |
| Math reasoning | v3 (this repo) | 84% GSM8K (vs 52% v1) |
| Prompt injection resistance | v3 (this repo) | 100% adversarial edge cases |
| Code generation, debugging | v1 | 97% HumanEval |
| Tool-calling, function calling | v1 | 90% BFCL |
| Re-quantization or fine-tuning | BF16 weights | Full precision |
Best of both worlds: Use A-LoRA routing -- an intent classifier selects the appropriate LoRA at runtime. Code/tools queries use v1, instruction/reasoning queries use v3. See Chimere ODO.
Quick start (chimere-server, recommended)
# 1. Backend (one-time): build the ik_llama.cpp fork with sm_120 CUDA + Mamba-2 backport
git clone https://github.com/AIdevsmartdata/ik_llama.cpp.git ~/ik_llama.cpp
cd ~/ik_llama.cpp
git checkout mamba2-nemotron-h-backport
cmake -B build_sm120 -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=120 -DGGML_NATIVE=OFF
cmake --build build_sm120 -j
# 2. Server
git clone https://github.com/AIdevsmartdata/chimere.git
cd chimere/chimere-server
LD_LIBRARY_PATH=$HOME/ik_llama.cpp/build_sm120/ggml/src:$HOME/ik_llama.cpp/build_sm120/src:/usr/local/cuda-12.8/lib64 \
cargo build --release --features server --bin chimere-server
# 3. Model + tokenizer
mkdir -p ~/models && cd ~/models
hf download Kevletesteur/Qwen3.5-35B-A3B-Chimere-v3-GGUF chimere-v3-ramp.gguf
hf download Qwen/Qwen3.5-35B-A3B tokenizer.json --local-dir tokenizers/qwen35
# 4. Run (production env vars)
CHIMERE_MODEL=$PWD/chimere-v3-ramp.gguf \
CHIMERE_TOKENIZER=$PWD/tokenizers/qwen35/tokenizer.json \
CHIMERE_LLAMA_BACKEND=1 \
CHIMERE_NCMOE=3 \
CHIMERE_KV_MAX_SEQ=65536 \
CHIMERE_PORT=8081 \
CHIMERE_FORCE_QWEN35=1 \
LD_LIBRARY_PATH=$HOME/ik_llama.cpp/build_sm120/ggml/src:$HOME/ik_llama.cpp/build_sm120/src:/usr/local/cuda-12.8/lib64 \
~/chimere/chimere-server/target/release/chimere-server
# 5. Hello world
curl -s http://localhost:8081/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{"messages":[{"role":"user","content":"Hello"}],"max_tokens":64}'
Engram (optional, prod-only)
Chimere ships an n-gram logit bias overlay loaded from binary .engr tables. To enable it, set:
CHIMERE_ENGRAM_DIR=/path/to/engram_tables # directory of *.engr files
CHIMERE_ENGRAM_ALPHA=0.1 # logit bias strength
The engram tables are tokenizer-specific (Qwen3.5 vocab) and used as a per-domain overlay (kine, code, cyber, general). They are intended as a domain-knowledge injector, not a measured quality booster — see the chimere repo README for the honest status of the path.
Quick start (generic GGUF runtimes)
If you do not need the chimere stack, the GGUF works with any Qwen3.5-compatible runtime:
# llama.cpp / llama-server
llama-server \
-m chimere-v3-ramp.gguf \
-ngl 99 --n-cpu-moe 4 -c 32768 \
--flash-attn on --jinja --port 8081
# For 16 GB VRAM (RTX 5060 Ti / RTX 4080):
# Add KV cache quantization to save VRAM:
# -ctk q8_0 -ctv q4_0
Recommended sampling parameters
| Mode | temp | top_p | top_k | presence_penalty |
|---|---|---|---|---|
| Thinking (default) | 1.0 | 0.95 | 20 | 0.0 |
| Thinking + code/tools | 0.6 | 0.95 | 20 | 0.0 |
| No-think | 0.7 | 0.8 | 20 | 0.0 |
Backend
The official chimere-server runtime links against a customized ik_llama.cpp fork (branch mamba2-nemotron-h-backport, head of upstream PR ikawrakow/ik_llama.cpp#1593).
Highlights of the chimere-specific layer on top of ik_llama:
- Custom C++ fast sampler exporting
sample_token_fast,set_logit_bias,set_engram_bias,clear_engram_biasandtake_packed_logprobs— avoids a ~993 KB logits copy per token, packs OpenAI-format top-5 logprobs. - K-cache Hadamard rotation, fused MoE up/gate, grouped expert routing — all enabled by default via
cparams. - Multi-agent KV / SSM state save & restore via
llama_state_seq_*, keyed on the OpenAIuserfield. Up toCHIMERE_MAX_AGENTS(default 4) concurrent personas with their own conversation state. - An OpenAI-compatible HTTP layer in Rust (axum 0.8), supporting non-streaming and SSE streaming, tool calls,
<think>reasoning extraction andchat_template_kwargs.enable_thinking.
Multi-architecture support
The same chimere-server runtime is not Qwen-only any more. As of Step 7 (April 2026), it dispatches between two code paths based on the GGUF's general.architecture metadata:
- Qwen3.5-35B-A3B (
qwen35moe) — full production stack: MTP, MRoPE, Engram, agent scheduler, custom Candle / cudarc / libllama paths. This GGUF. - Mamba-2 / Nemotron-H MoE / Mamba-1 / Mamba-2 hybrids — libllama-only path via
GenericModel. No MTP, no Engram, single-agent only at Step 7. Validated end-to-end onunsloth/Nemotron-3-Nano-30B-A3B-GGUF(Q4_0 and UD-IQ3_XXS) at ~45 tok/s on RTX 5060 Ti, NCMOE=30, ctx 2048, via the bundledtest-nemotronsmoke binary.
Models that should run via the same Generic path (untested at the chimere level — your mileage may vary): Granite 4.0 H-Tiny / H-Small / H-Micro, Falcon-H1 0.5B – 34B, Bamba-9B v1 / v2, state-spaces/mamba2-*, mistralai/Mamba-Codestral-7B-v0.1, AI21-Jamba-Reasoning-3B.
RAMP Quantization Details
Custom per-tensor quality overrides -- critical paths get higher precision. Overall: ~3.78 BPW.
| Tensor | Quant | BPW | Rationale |
|---|---|---|---|
| attn_v (value) | Q8_0 | 8.0 | Most critical -- errors cause hallucinations |
| ssm_alpha, ssm_d | Q8_0 | 8.0 | GDN recurrent params, tiny but hypersensitive |
| attn_k (key) | Q6_K | 6.5 | Important for attention routing |
| ssm_dt | Q6_K | 6.5 | GDN timestep |
| token_embd, output | Q6_K | 6.5 | Shared embeddings |
| attn_q, attn_output | Q5_K | 5.5 | More tolerant |
| ssm_in, ssm_out | Q5_K | 5.5 | SSM projections |
| 256 MoE experts (FFN) | IQ3_S | 3.44 | 80% of params, high MoE redundancy |
- imatrix: Generated on BF16 model (B200, 192 GB VRAM), 200 calibration chunks
- Result: 15 GB with zero quality loss on agentic benchmarks vs BF16
Training Details
| Parameter | Value |
|---|---|
| Base model | Qwen/Qwen3.5-35B-A3B (MoE, 256 experts) |
| Method | SFT BF16 LoRA r64, completion-only loss |
| Dataset | 10,191 samples (v1 base + 428 additional: IFEval strict, OPSDC reasoning, instruction following) |
| Epochs | 1 (160 steps, batch 64) |
| Training GPU | NVIDIA B200 |
| Training cost | ~$2 |
v3 dataset additions (on top of v1 base)
- +50 IFEval strict (5 constraint categories)
- +30 strict code (no markdown)
- +30 code gen with thinking
- +30 instruction following
- +20 OPSDC-compressed reasoning (-64% tokens)
- +15 multi-turn agentic
Limitations
- MTP infrastructure present, gated. This GGUF carries an MTP (multi-token prediction) head — chimere-server detects it via
n_nextn_layer = 1and exposes the speculative-decoding infrastructure (mtp_scheduler.rs,MtpOpFFI). An early March bench on a previous build measured +49.5% token acceptance rate for the MTP draft path; that figure is not currently reproducible becausebench_mtp.rs:104-167has Benchmarks 2 and 5 hard-coded asSKIPPEDwith the commentcrash in ik_llama MTP graph, KV cache issue for layer 41. Until that fix lands the 80 tok/s figure above is the non-MTP path. We will re-publish the MTP gain once the bench passes. - Engram is a domain-knowledge overlay, not a measured quality boost. The only saved engram eval in the chimere repo (
benchmarks/engram_trained_eval.json) was run on GPT-2 + wikitext-2 and shows a −13.39% PPL regression on that out-of-distribution setup. No Qwen3.5-specific perplexity eval has been published yet. Engram is shipped as an optional per-domain n-gram bias (kine, code, cyber, general); qualitative use shows specialized vocabulary in responses (drainage bronchique postural,EMII, ...) on the kiné domain, but there is no quantitative claim attached to it today. - Multi-slot concurrent decoding via
ik_llama.cppis broken under heavy load (ik_llamamulti-slot bug, slot 0 contamination of system prompts under contention). Thechimere-serverproduction deployment is single-slot. Stockllama-serverdoes NOT have this bug if you need parallel slots. - Tool-calling sampler defaults:
presence_penaltydefaults to0.0— a previous default of1.5killed code generation and long reasoning blocks. See chimere-server source.
Files
| File | Size | Description |
|---|---|---|
chimere-v3-ramp.gguf |
15 GB | v3 RAMP GGUF (instructions + reasoning focus) |
imatrix.dat |
184 MB | Importance matrix used for quantization |
Related
- chimere -- Official Rust runtime (chimere-server) with Engram, MTP, multi-agent, multi-arch dispatch
- ik_llama.cpp fork -- Backend with Mamba-2 + Nemotron-H backport (PR #1593)
- Chimere v1 GGUF -- Best code + tools
- BF16 full weights -- For re-quantization or fine-tuning
- LoRA adapter -- For further training
- Chimere ODO -- A-LoRA intent routing
Citation
@misc{chimere-v3-2026,
title={Chimere v3: Claude Opus 4.6 Distillation of Qwen3.5-35B-A3B MoE for Instructions and Reasoning},
author={Kevletesteur},
year={2026},
url={https://huggingface.co/Kevletesteur/Qwen3.5-35B-A3B-Chimere-v3-GGUF}
}
- Downloads last month
- 18
We're not able to determine the quantization variants.