Instructions to use YuYu1015/Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated-int4-AutoRound with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use YuYu1015/Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated-int4-AutoRound with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="YuYu1015/Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated-int4-AutoRound") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("YuYu1015/Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated-int4-AutoRound") model = AutoModelForImageTextToText.from_pretrained("YuYu1015/Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated-int4-AutoRound") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use YuYu1015/Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated-int4-AutoRound with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "YuYu1015/Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated-int4-AutoRound" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "YuYu1015/Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated-int4-AutoRound", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/YuYu1015/Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated-int4-AutoRound
- SGLang
How to use YuYu1015/Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated-int4-AutoRound with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "YuYu1015/Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated-int4-AutoRound" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "YuYu1015/Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated-int4-AutoRound", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "YuYu1015/Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated-int4-AutoRound" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "YuYu1015/Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated-int4-AutoRound", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use YuYu1015/Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated-int4-AutoRound with Docker Model Runner:
docker model run hf.co/YuYu1015/Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated-int4-AutoRound
Use Docker
docker model run hf.co/YuYu1015/Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated-int4-AutoRoundHuihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated-int4-AutoRound
English
INT4 AutoRound quantization of huihui-ai/Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated — a Claude-4.7-Opus distilled, abliterated MoE — optimized for NVIDIA DGX Spark (GB10 SM121) with Marlin INT4 kernel acceleration.
Model Details
| Item | Value |
|---|---|
| Architecture | MoE (35B total, 3B active, 256 experts / 8 routed + 1 shared) + GDN (Mamba) + Attention hybrid |
| Base model | huihui-ai/Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated |
| Upstream model | Qwen/Qwen3.6-35B-A3B |
| Distillation | Reasoning / chain-of-thought distilled from Claude 4.7 Opus by huihui-ai |
| Abliteration | Safety filtering removed by huihui-ai (no TransformerLens) |
| Quantized by | YuYu1015 |
| Model size | ~23 GB (vs ~71.9 GB BF16 original) |
| Context length | Up to 262,144 tokens (limited by KV cache on 128GB) |
| Thinking mode | Supported (enable_thinking: true/false) |
| Tool calling | Supported (qwen3_xml parser) |
| MTP | Built-in MTP weights included |
About the Base Model
This is a quantization of huihui-ai's Claude-4.7-Opus distilled variant of Qwen3.6-35B-A3B, with abliteration applied on top:
- Distillation source: Claude 4.7 Opus (Anthropic)
- Distillation target: reasoning quality and chain-of-thought patterns
- Abliteration: orthogonalization of safety-refusal directions in residual stream
The result is an MoE model that retains Claude-4.7-Opus's reasoning style while removing default safety filters. Quantization preserves these traits — INT4 AutoRound (W4A16) recovers ~99.5% of the BF16 baseline.
Quantization Details
| Item | Value |
|---|---|
| Method | Intel AutoRound v0.12.2 |
| Bits | 4 |
| Group size | 128 |
| Symmetric | Yes |
| Format | auto_round (GPTQ-compatible) |
| Iterations | 200 |
| Calibration dataset | NeelNanda/pile-10k (auto-round default) |
| Calibration samples | 512 |
| Calibration sequence length | 2048 |
| Torch compile | Enabled (--enable_torch_compile) |
| Hardware | NVIDIA DGX Spark (GB10, 128GB unified memory) |
Layers Preserved in BF16
The following layers are not quantized to preserve model quality:
| Layer | Reason |
|---|---|
lm_head |
Output head, sensitive to quantization noise (auto-excluded by shape) |
embed_tokens |
Input embeddings (auto-excluded by shape) |
mlp.shared_expert.* |
Shared expert weights, processes every token |
mlp.shared_expert_gate |
Shared expert routing gate |
mlp.gate |
MoE routing gate (auto-excluded by quantization scheme) |
linear_attn.* |
GDN/DeltaNet layers, may output zeros if quantized |
mtp.fc |
Multi-Token Prediction projection (preserved as BF16) |
Performance
Tested on a single NVIDIA DGX Spark (GB10, 128GB LPDDR5X, SM121):
| Configuration | Decode Speed | Notes |
|---|---|---|
| INT4 + DFlash-15 (daily conversation) | 40-60 tok/s | With Qwen3.6-35B-A3B-DFlash drafter |
The DFlash drafter is the same one used for the base Qwen3.6-35B-A3B. Acceptance rate on this distilled+abliterated variant may be slightly lower than on the original model — verify with
spec_decode_num_accepted_tokens_totalmetric and reducenum_speculative_tokensif it falls below 50%.
Speculative Decoding
This model supports two speculative decoding methods:
DFlash (requires separate drafter model):
--speculative-config '{"method": "dflash", "model": "z-lab/Qwen3.6-35B-A3B-DFlash", "num_speculative_tokens": 15}'
MTP (uses built-in weights, no extra model needed):
--speculative-config '{"method": "mtp", "num_speculative_tokens": 1}'
Serving with vLLM
vllm serve /path/to/model \
--quantization moe_wna16 \
--served-model-name qwen3.6-35b-a3b-claude47opus \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_xml \
--kv-cache-dtype auto \
--gpu-memory-utilization 0.80 \
--max-model-len 65536 \
--enable-prefix-caching \
--enable-chunked-prefill \
--trust-remote-code \
--language-model-only
DGX Spark (SM121) Compatibility Notes
- Use
--quantization moe_wna16for Marlin INT4 kernel (SM121 compatible via SM120 binary compat) - FP8 KV cache is not compatible with GDN non-causal attention layers; use
--kv-cache-dtype auto - NVFP4 falls back to Marlin W4A16 on SM121 (missing
cvt.e2m1x2PTX instruction) - Runtime FP8 (
--quantization fp8) is not compatible with DFlash (drafter inherits FP8 config and crashes) --language-model-onlyskips vision encoder profiling for text-only inference--performance-mode throughputenables CUDA graphs and kernels for throughput optimization- Clear page cache before starting on UMA:
sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
Safety Warning
This model has safety filtering removed (abliterated) and is distilled from a frontier model (Claude 4.7 Opus). It may generate sensitive, controversial, or inappropriate content with high fluency and reasoning depth. Users are solely responsible for all consequences arising from its use. Please ensure usage complies with local laws and ethical standards. Not suitable for public-facing or production applications.
Credits
- Upstream Model: Qwen/Qwen3.6-35B-A3B by Alibaba Qwen Team
- Distillation Source: Claude 4.7 Opus by Anthropic
- Distillation + Abliteration: huihui-ai
- INT4 Quantization: YuYu1015 on NVIDIA DGX Spark (GB10)
- Quantization Tool: Intel AutoRound
繁體中文
huihui-ai/Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated 的 INT4 AutoRound 量化版本 — 一個從 Claude 4.7 Opus 蒸餾、再 abliterated 的 MoE 模型 — 針對 NVIDIA DGX Spark (GB10 SM121) 最佳化,使用 Marlin INT4 kernel 加速。
模型資訊
| 項目 | 數值 |
|---|---|
| 架構 | MoE(35B 總參數, 3B 活躍, 256 experts / 8 routed + 1 shared)+ GDN (Mamba) + Attention 混合 |
| 基礎模型 | huihui-ai/Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated |
| 上游模型 | Qwen/Qwen3.6-35B-A3B |
| 蒸餾來源 | 由 huihui-ai 從 Claude 4.7 Opus 蒸餾推理 / chain-of-thought |
| 去審查 | huihui-ai 移除安全過濾(無 TransformerLens) |
| 量化者 | YuYu1015 |
| 模型大小 | ~23 GB(原版 BF16 約 71.9 GB) |
| Context 長度 | 最高 262,144 tokens(受限於 128GB 統一記憶體上的 KV cache) |
| 思考模式 | 支援(enable_thinking: true/false) |
| 工具呼叫 | 支援(qwen3_xml parser) |
| MTP | 內建 MTP 權重 |
關於基礎模型
此版本是 huihui-ai 對 Qwen3.6-35B-A3B 進行 Claude-4.7-Opus 蒸餾後再加上 abliteration 的成果:
- 蒸餾來源:Claude 4.7 Opus(Anthropic)
- 蒸餾目標:推理品質與 chain-of-thought 模式
- 去審查:在 residual stream 中正交化安全拒絕方向
成品保留 Claude 4.7 Opus 的推理風格,同時移除預設安全過濾。INT4 AutoRound(W4A16)量化能保留約 99.5% 的 BF16 基線。
量化詳情
| 項目 | 數值 |
|---|---|
| 方法 | Intel AutoRound v0.12.2 |
| 位元數 | 4 |
| Group size | 128 |
| 對稱量化 | 是 |
| 格式 | auto_round(GPTQ 相容) |
| 迭代次數 | 200 |
| 校準資料集 | NeelNanda/pile-10k(auto-round 預設) |
| 校準樣本數 | 512 |
| 校準序列長度 | 2048 |
| Torch compile | 啟用(--enable_torch_compile) |
| 量化硬體 | NVIDIA DGX Spark(GB10, 128GB 統一記憶體) |
保留 BF16 的層
以下層未被量化以保持模型品質:
| 層 | 原因 |
|---|---|
lm_head |
輸出頭,對量化雜訊敏感(因 shape 自動排除) |
embed_tokens |
輸入嵌入(因 shape 自動排除) |
mlp.shared_expert.* |
共享專家權重,處理每個 token |
mlp.shared_expert_gate |
共享專家路由門 |
mlp.gate |
MoE 路由門(量化方案自動排除) |
linear_attn.* |
GDN/DeltaNet 層,量化後可能輸出零 |
mtp.fc |
Multi-Token Prediction 投影層(保留 BF16) |
效能表現
在單台 NVIDIA DGX Spark (GB10, 128GB LPDDR5X, SM121) 上實測:
| 配置 | 解碼速度 | 備註 |
|---|---|---|
| INT4 + DFlash-15(日常對話) | 40-60 tok/s | 搭配 Qwen3.6-35B-A3B-DFlash drafter |
DFlash drafter 是基於原版 Qwen3.6-35B-A3B 訓練的,在這個 distilled + abliterated 版本上的接受率可能略低於原版。可用
spec_decode_num_accepted_tokens_totalmetric 驗證,若低於 50% 請降低num_speculative_tokens。
投機解碼
本模型支援兩種投機解碼方式:
DFlash(需額外下載 drafter 模型):
--speculative-config '{"method": "dflash", "model": "z-lab/Qwen3.6-35B-A3B-DFlash", "num_speculative_tokens": 15}'
MTP(使用內建權重,不需額外模型):
--speculative-config '{"method": "mtp", "num_speculative_tokens": 1}'
使用 vLLM 部署
vllm serve /path/to/model \
--quantization moe_wna16 \
--served-model-name qwen3.6-35b-a3b-claude47opus \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_xml \
--kv-cache-dtype auto \
--gpu-memory-utilization 0.80 \
--max-model-len 65536 \
--enable-prefix-caching \
--enable-chunked-prefill \
--trust-remote-code \
--language-model-only
DGX Spark (SM121) 相容性說明
- 使用
--quantization moe_wna16啟用 Marlin INT4 kernel(SM121 透過 SM120 二進制相容性支援) - FP8 KV cache 與 GDN non-causal attention 不相容,請使用
--kv-cache-dtype auto - NVFP4 在 SM121 上會 fallback 到 Marlin W4A16(缺少
cvt.e2m1x2PTX 指令) - Runtime FP8(
--quantization fp8)與 DFlash 不相容(drafter 繼承 FP8 config 導致 crash) --language-model-only跳過視覺編碼器 profiling,加速純文字推理啟動--performance-mode throughput啟用吞吐量最佳化的 CUDA graphs 和 kernel- UMA 架構啟動前請先清除 page cache:
sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
安全警告
此模型已移除安全過濾機制(abliterated)且蒸餾自 frontier model(Claude 4.7 Opus),可能以高流暢度與深度推理產生敏感、爭議性或不當內容。使用者須自行承擔所有風險與法律責任,並確保使用方式符合當地法規與倫理標準。不適用於公開或生產環境。
致謝
- 上游模型:Qwen/Qwen3.6-35B-A3B,Alibaba Qwen 團隊
- 蒸餾來源:Anthropic Claude 4.7 Opus
- 蒸餾 + 去審查:huihui-ai
- INT4 量化:YuYu1015,於 NVIDIA DGX Spark (GB10) 上完成
- 量化工具:Intel AutoRound
- Downloads last month
- 1,428
Model tree for YuYu1015/Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated-int4-AutoRound
Base model
Qwen/Qwen3.6-35B-A3B
Install from pip and serve model
# Install vLLM from pip: pip install vllm# Start the vLLM server: vllm serve "YuYu1015/Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated-int4-AutoRound"# Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "YuYu1015/Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated-int4-AutoRound", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'