Instructions to use Nishant2414/Qwen3.6-35B-A3B-TQ3_4S with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use Nishant2414/Qwen3.6-35B-A3B-TQ3_4S with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="Nishant2414/Qwen3.6-35B-A3B-TQ3_4S", filename="Qwen3.6-35B-A3B-TQ3_4S.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use Nishant2414/Qwen3.6-35B-A3B-TQ3_4S with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf Nishant2414/Qwen3.6-35B-A3B-TQ3_4S:BF16 # Run inference directly in the terminal: llama-cli -hf Nishant2414/Qwen3.6-35B-A3B-TQ3_4S:BF16
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf Nishant2414/Qwen3.6-35B-A3B-TQ3_4S:BF16 # Run inference directly in the terminal: llama-cli -hf Nishant2414/Qwen3.6-35B-A3B-TQ3_4S:BF16
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf Nishant2414/Qwen3.6-35B-A3B-TQ3_4S:BF16 # Run inference directly in the terminal: ./llama-cli -hf Nishant2414/Qwen3.6-35B-A3B-TQ3_4S:BF16
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf Nishant2414/Qwen3.6-35B-A3B-TQ3_4S:BF16 # Run inference directly in the terminal: ./build/bin/llama-cli -hf Nishant2414/Qwen3.6-35B-A3B-TQ3_4S:BF16
Use Docker
docker model run hf.co/Nishant2414/Qwen3.6-35B-A3B-TQ3_4S:BF16
- LM Studio
- Jan
- vLLM
How to use Nishant2414/Qwen3.6-35B-A3B-TQ3_4S with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Nishant2414/Qwen3.6-35B-A3B-TQ3_4S" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Nishant2414/Qwen3.6-35B-A3B-TQ3_4S", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/Nishant2414/Qwen3.6-35B-A3B-TQ3_4S:BF16
- Ollama
How to use Nishant2414/Qwen3.6-35B-A3B-TQ3_4S with Ollama:
ollama run hf.co/Nishant2414/Qwen3.6-35B-A3B-TQ3_4S:BF16
- Unsloth Studio
How to use Nishant2414/Qwen3.6-35B-A3B-TQ3_4S with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Nishant2414/Qwen3.6-35B-A3B-TQ3_4S to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Nishant2414/Qwen3.6-35B-A3B-TQ3_4S to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for Nishant2414/Qwen3.6-35B-A3B-TQ3_4S to start chatting
- Pi
How to use Nishant2414/Qwen3.6-35B-A3B-TQ3_4S with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf Nishant2414/Qwen3.6-35B-A3B-TQ3_4S:BF16
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "Nishant2414/Qwen3.6-35B-A3B-TQ3_4S:BF16" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use Nishant2414/Qwen3.6-35B-A3B-TQ3_4S with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf Nishant2414/Qwen3.6-35B-A3B-TQ3_4S:BF16
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default Nishant2414/Qwen3.6-35B-A3B-TQ3_4S:BF16
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use Nishant2414/Qwen3.6-35B-A3B-TQ3_4S with Docker Model Runner:
docker model run hf.co/Nishant2414/Qwen3.6-35B-A3B-TQ3_4S:BF16
- Lemonade
How to use Nishant2414/Qwen3.6-35B-A3B-TQ3_4S with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull Nishant2414/Qwen3.6-35B-A3B-TQ3_4S:BF16
Run and chat with the model
lemonade run user.Qwen3.6-35B-A3B-TQ3_4S-BF16
List all available models
lemonade list
Qwen3.6-35B-A3B-TQ3_4S
GGUF quantization of Qwen/Qwen3.6-35B-A3B using TQ3_4S with mixed-precision MoE compression — 2-bit experts, 4-bit attention.
Files
| File | Description |
|---|---|
Qwen3.6-35B-A3B-TQ3_4S.gguf |
Main model (12.4 GiB, 3.07 BPW) |
mmproj-BF16.gguf |
Multimodal projector (BF16) |
Quantization
MoE experts tolerate aggressive compression because only 8/256 are active per token. This quantization exploits that asymmetry:
| Component | Quant | Rationale |
|---|---|---|
| Expert MLP gate/up | Q2_K | 98% of params, MoE-tolerant |
| Expert MLP down | Q3_K | Write-back sensitivity |
| Attention Q/K/V/O | TQ3_4S | WHT-protected |
| Embeddings + output | Q6_K | Quality anchor |
Runtime Requirement
This model requires the public TurboQuant runtime fork:
Recommended Settings (16GB VRAM)
./build/bin/llama-server \
-m Qwen3.6-35B-A3B-TQ3_4S.gguf \
-ngl 99 -c 4096 -np 1 \
-ctk q4_0 -ctv tq3_0 -fa on \
--jinja \
--reasoning off --reasoning-budget 0 --reasoning-format deepseek
With vision:
./build/bin/llama-server \
-m Qwen3.6-35B-A3B-TQ3_4S.gguf \
--mmproj mmproj-BF16.gguf \
-ngl 99 -c 4096 -np 1 \
-ctk q4_0 -ctv tq3_0 -fa on \
--jinja --no-mmproj-offload \
--reasoning off --reasoning-budget 0 --reasoning-format deepseek
Performance (RTX 5060 Ti 16GB)
| Metric | Value |
|---|---|
| PP512 | 1832 tok/s |
| TG128 | 107 tok/s |
| Size | 12.4 GiB |
| BPW | 3.07 |
| ngl | 99 (full GPU) |
Fits entirely in 16GB VRAM — no CPU offload needed.
Quality
10/10 correct on standard QA benchmark (capital of France, 2+2, Python reverse string, gravity, WW2, primes, boiling point, Shakespeare, Jupiter, hello→Hola).
Base Model
Qwen/Qwen3.6-35B-A3B- Source:
unsloth/Qwen3.6-35B-A3B-GGUF(Q8_0)
License
Apache 2.0 — same as the base model.
Tool Call Validation
Tested with --jinja on both --reasoning off and --reasoning on --reasoning-budget 2048:
| Test | reasoning off | reasoning on |
|---|---|---|
| Basic tool call trigger | ✅ | ✅ |
| Tool response → final answer (no loop) | ✅ | ✅ |
| Correct tool selection from multiple | ✅ | ✅ |
| No tool call for simple questions | ✅ | ✅ |
| Multi-step tool use | ✅ | ✅ |
| Nested quote escaping retry (no loop) | ✅ | ✅ |
| Total | 10/10 | 10/10 |
Recommended settings for tool-use / agentic workflows
--jinja --reasoning off --reasoning-budget 0 --reasoning-format deepseek
Avoid --presence-penalty above 0.5 for tool-use — high values diversify reasoning tokens but don't improve structured JSON output, and can cause repeated near-identical tool calls in agent loops.
If using --reasoning on, ensure your agent framework detects consecutive identical tool calls and breaks after 2-3 retries.
Run tests yourself
chmod +x test_tool_calls.sh
./test_tool_calls.sh 8085
- Downloads last month
- 63
We're not able to determine the quantization variants.
Model tree for Nishant2414/Qwen3.6-35B-A3B-TQ3_4S
Base model
Qwen/Qwen3.6-35B-A3B
ollama run hf.co/Nishant2414/Qwen3.6-35B-A3B-TQ3_4S:BF16