Instructions to use sakamakismile/Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use sakamakismile/Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated-NVFP4 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="sakamakismile/Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated-NVFP4") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("sakamakismile/Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated-NVFP4") model = AutoModelForImageTextToText.from_pretrained("sakamakismile/Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated-NVFP4") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use sakamakismile/Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated-NVFP4 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "sakamakismile/Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated-NVFP4" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "sakamakismile/Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated-NVFP4", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/sakamakismile/Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated-NVFP4
- SGLang
How to use sakamakismile/Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated-NVFP4 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "sakamakismile/Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "sakamakismile/Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated-NVFP4", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "sakamakismile/Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "sakamakismile/Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated-NVFP4", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use sakamakismile/Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated-NVFP4 with Docker Model Runner:
docker model run hf.co/sakamakismile/Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated-NVFP4
Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated-NVFP4
NVFP4 quantized version of huihui-ai/Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated — Claude 4.7 Opus distilled, abliterated (uncensored) Qwen 3.6 MoE with 256 experts and 3B active parameters.
67 GB → 21.9 GB. Single NVIDIA Blackwell GPU. 182 tok/s. 256K context. VLM. Uncensored.
Why This Model
Claude 4.7 Opus intelligence distilled into a locally runnable MoE, with abliteration for unrestricted research use:
- 256 experts, 3B active — extreme sparsity = extreme speed
- Claude 4.7 Opus distillation — latest Opus reasoning quality
- 262K native context — fits on single 96 GB GPU with FP8 KV
- VLM — vision fully functional (BF16 precision)
- Abliterated — no refusals, full capability for research and local deployment
Key Specs
| Base model | huihui-ai/Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated |
| Architecture | Qwen3.5 MoE — 35B total, 3B active, 256 experts (8 routed + 1 shared) |
| Quantization | NVFP4 W4A4 (weights FP4, activations FP4, scales FP8) |
| Format | compressed-tensors (native vLLM support) |
| Tool | vllm-project/llm-compressor (main) |
| Calibration | 512 samples, ultrachat_200k, seq_len=2048, moe_calibrate_all_experts=True |
| Size | 21.9 GB |
| Max context | 262,144 tokens (native) |
| Requires | NVIDIA Blackwell GPU (SM 120), vLLM nightly (cu130) |
Quickstart
vLLM
vllm serve sakamakismile/Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated-NVFP4 \
--max-model-len 32768 \
--reasoning-parser qwen3 \
--kv-cache-dtype fp8
With tool calling (agentic)
vllm serve sakamakismile/Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated-NVFP4 \
--max-model-len 32768 \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--kv-cache-dtype fp8
Docker
docker run --gpus device=0 -p 8090:8090 \
-v /path/to/model:/models/current:ro \
--shm-size 16gb \
vllm/vllm-openai:cu130-nightly \
vllm serve /models/current --port 8090 --max-model-len 32768 \
--reasoning-parser qwen3 --kv-cache-dtype fp8
Benchmark
Single NVIDIA RTX PRO 6000 Blackwell (96 GB VRAM), 256K context, FP8 KV cache.
| Test | Tokens | Speed | Result |
|---|---|---|---|
| English (CAP theorem) | 256 | 182 tok/s | PASS |
| Japanese (量子コンピュータ) | 256 | 182 tok/s | PASS |
| Code (async scheduler) | 512 | 182 tok/s | PASS |
| Math (Bayes theorem) | 512 | 182 tok/s | PASS |
| Burst stability (×3) | 512 | 182-188 tok/s | PASS — stable |
| VLM (shape recognition) | 256 | — | PASS ✅ |
Sustained: ~182 tok/s (single GPU, 256K context).
VRAM Usage
| Context Length | VRAM | KV Cache |
|---|---|---|
| 262,144 (256K) | 95.6 GB | FP8 |
256K context fits on a single 96 GB Blackwell GPU with FP8 KV cache.
Also Available
| Model | Speed | Link |
|---|---|---|
| Qwen3.6-35B-A3B (base) | 182 tok/s | sakamakismile/Qwen3.6-35B-A3B-NVFP4 |
| Huihui abliterated | 175 tok/s | sakamakismile/Huihui-Qwen3.6-35B-A3B-abliterated-NVFP4 |
| Claude 4.6 Opus abliterated | 175 tok/s | sakamakismile/Huihui-Qwen3.6-35B-A3B-Claude-4.6-Opus-abliterated-NVFP4 |
Quantization Details
Recipe
recipe = QuantizationModifier(
targets="Linear",
scheme="NVFP4",
ignore=["lm_head", "re:.*visual.*", "re:.*mlp.gate$", "re:.*mlp.shared_expert_gate$"],
)
Calibration
- Dataset: HuggingFaceH4/ultrachat_200k (train_sft split)
- Samples: 512
- Max sequence length: 2048
moe_calibrate_all_experts=True— ensures all 256 experts receive calibration data
Reproduction
from transformers import Qwen3_5MoeForConditionalGeneration, AutoProcessor, AutoTokenizer
from datasets import load_dataset
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
MODEL_ID = "huihui-ai/Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated"
model = Qwen3_5MoeForConditionalGeneration.from_pretrained(MODEL_ID, dtype="auto", trust_remote_code=True)
processor = AutoProcessor.from_pretrained(MODEL_ID, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
recipe = QuantizationModifier(
targets="Linear", scheme="NVFP4",
ignore=["lm_head", "re:.*visual.*", "re:.*mlp.gate$", "re:.*mlp.shared_expert_gate$"],
)
ds = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft[:512]")
ds = ds.shuffle(seed=42)
def preprocess(example):
return {"text": tokenizer.apply_chat_template(example["messages"], tokenize=False)}
ds = ds.map(preprocess)
def tokenize(sample):
return tokenizer(sample["text"], padding=False, max_length=2048,
truncation=True, add_special_tokens=False)
ds = ds.map(tokenize, remove_columns=ds.column_names)
oneshot(model=model, dataset=ds, recipe=recipe,
max_seq_length=2048, num_calibration_samples=512,
moe_calibrate_all_experts=True)
model.save_pretrained("output", save_compressed=True)
processor.save_pretrained("output")
tokenizer.save_pretrained("output")
Environment
| Package | Version |
|---|---|
| torch | 2.11.0+cu130 |
| transformers | 5.5.4 |
| llmcompressor | 0.1.dev (main) |
| compressed-tensors | 0.15.1a20260414 |
| CUDA | 13.0 |
Requirements
- GPU: NVIDIA Blackwell (SM 120) — RTX 5090, 5080, 5070 Ti, RTX PRO 6000
- VRAM: ~22 GB minimum (text only), 96 GB for 256K context
- Software: vLLM nightly (cu130)
Notes
- Abliterated (uncensored). Use responsibly.
- Multimodal (vision) fully functional at BF16 precision.
- Gated DeltaNet + Attention hybrid architecture.
- NVFP4 is Blackwell-specific. Will not work on Ampere/Hopper.
- Use
--kv-cache-dtype fp8for 2x KV capacity at no quality cost.
Credits
- Abliteration + Claude distillation: huihui-ai
- Original model: Qwen
- Quantization tool: llm-compressor
- SM120 NVFP4 kernels: blackwell-geforce-nvfp4-gemm
- Quantized by: Lna-Lab
Support the Base Model Author
- Ko-fi: https://ko-fi.com/huihuiai
- Bitcoin:
bc1qqnkhuchxw0zqjh2ku3lu4hq45hc6gy84uk70ge
- Downloads last month
- 14,138
Model tree for sakamakismile/Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated-NVFP4
Base model
Qwen/Qwen3.6-35B-A3B