Instructions to use sakamakismile/Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use sakamakismile/Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated-NVFP4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="sakamakismile/Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated-NVFP4")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("sakamakismile/Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated-NVFP4")
model = AutoModelForImageTextToText.from_pretrained("sakamakismile/Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated-NVFP4")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use sakamakismile/Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated-NVFP4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "sakamakismile/Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated-NVFP4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "sakamakismile/Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/sakamakismile/Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated-NVFP4

SGLang

How to use sakamakismile/Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated-NVFP4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "sakamakismile/Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated-NVFP4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "sakamakismile/Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "sakamakismile/Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated-NVFP4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "sakamakismile/Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use sakamakismile/Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated-NVFP4 with Docker Model Runner:
```
docker model run hf.co/sakamakismile/Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated-NVFP4
```

Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated-NVFP4

NVFP4 quantized version of huihui-ai/Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated — Claude 4.7 Opus distilled, abliterated (uncensored) Qwen 3.6 MoE with 256 experts and 3B active parameters.

67 GB → 21.9 GB. Single NVIDIA Blackwell GPU. 182 tok/s. 256K context. VLM. Uncensored.

Why This Model

Claude 4.7 Opus intelligence distilled into a locally runnable MoE, with abliteration for unrestricted research use:

256 experts, 3B active — extreme sparsity = extreme speed
Claude 4.7 Opus distillation — latest Opus reasoning quality
262K native context — fits on single 96 GB GPU with FP8 KV
VLM — vision fully functional (BF16 precision)
Abliterated — no refusals, full capability for research and local deployment

Key Specs


Base model	huihui-ai/Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated
Architecture	Qwen3.5 MoE — 35B total, 3B active, 256 experts (8 routed + 1 shared)
Quantization	NVFP4 W4A4 (weights FP4, activations FP4, scales FP8)
Format	`compressed-tensors` (native vLLM support)
Tool	vllm-project/llm-compressor (main)
Calibration	512 samples, ultrachat_200k, seq_len=2048, moe_calibrate_all_experts=True
Size	21.9 GB
Max context	262,144 tokens (native)
Requires	NVIDIA Blackwell GPU (SM 120), vLLM nightly (cu130)

Quickstart

vLLM

vllm serve sakamakismile/Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated-NVFP4 \
    --max-model-len 32768 \
    --reasoning-parser qwen3 \
    --kv-cache-dtype fp8

With tool calling (agentic)

vllm serve sakamakismile/Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated-NVFP4 \
    --max-model-len 32768 \
    --reasoning-parser qwen3 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --kv-cache-dtype fp8

Docker

docker run --gpus device=0 -p 8090:8090 \
    -v /path/to/model:/models/current:ro \
    --shm-size 16gb \
    vllm/vllm-openai:cu130-nightly \
    vllm serve /models/current --port 8090 --max-model-len 32768 \
    --reasoning-parser qwen3 --kv-cache-dtype fp8

Benchmark

Single NVIDIA RTX PRO 6000 Blackwell (96 GB VRAM), 256K context, FP8 KV cache.

Test	Tokens	Speed	Result
English (CAP theorem)	256	182 tok/s	PASS
Japanese (量子コンピュータ)	256	182 tok/s	PASS
Code (async scheduler)	512	182 tok/s	PASS
Math (Bayes theorem)	512	182 tok/s	PASS
Burst stability (×3)	512	182-188 tok/s	PASS — stable
VLM (shape recognition)	256	—	PASS ✅

Sustained: ~182 tok/s (single GPU, 256K context).

VRAM Usage

Context Length	VRAM	KV Cache
262,144 (256K)	95.6 GB	FP8

256K context fits on a single 96 GB Blackwell GPU with FP8 KV cache.

Also Available

Model	Speed	Link
Qwen3.6-35B-A3B (base)	182 tok/s	sakamakismile/Qwen3.6-35B-A3B-NVFP4
Huihui abliterated	175 tok/s	sakamakismile/Huihui-Qwen3.6-35B-A3B-abliterated-NVFP4
Claude 4.6 Opus abliterated	175 tok/s	sakamakismile/Huihui-Qwen3.6-35B-A3B-Claude-4.6-Opus-abliterated-NVFP4

Quantization Details

Recipe

recipe = QuantizationModifier(
    targets="Linear",
    scheme="NVFP4",
    ignore=["lm_head", "re:.*visual.*", "re:.*mlp.gate$", "re:.*mlp.shared_expert_gate$"],
)

Calibration

Dataset: HuggingFaceH4/ultrachat_200k (train_sft split)
Samples: 512
Max sequence length: 2048
moe_calibrate_all_experts=True — ensures all 256 experts receive calibration data

Reproduction

from transformers import Qwen3_5MoeForConditionalGeneration, AutoProcessor, AutoTokenizer
from datasets import load_dataset
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier

MODEL_ID = "huihui-ai/Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated"

model = Qwen3_5MoeForConditionalGeneration.from_pretrained(MODEL_ID, dtype="auto", trust_remote_code=True)
processor = AutoProcessor.from_pretrained(MODEL_ID, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)

recipe = QuantizationModifier(
    targets="Linear", scheme="NVFP4",
    ignore=["lm_head", "re:.*visual.*", "re:.*mlp.gate$", "re:.*mlp.shared_expert_gate$"],
)

ds = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft[:512]")
ds = ds.shuffle(seed=42)

def preprocess(example):
    return {"text": tokenizer.apply_chat_template(example["messages"], tokenize=False)}
ds = ds.map(preprocess)

def tokenize(sample):
    return tokenizer(sample["text"], padding=False, max_length=2048,
                     truncation=True, add_special_tokens=False)
ds = ds.map(tokenize, remove_columns=ds.column_names)

oneshot(model=model, dataset=ds, recipe=recipe,
        max_seq_length=2048, num_calibration_samples=512,
        moe_calibrate_all_experts=True)

model.save_pretrained("output", save_compressed=True)
processor.save_pretrained("output")
tokenizer.save_pretrained("output")

Environment

Package	Version
torch	2.11.0+cu130
transformers	5.5.4
llmcompressor	0.1.dev (main)
compressed-tensors	0.15.1a20260414
CUDA	13.0

Requirements

GPU: NVIDIA Blackwell (SM 120) — RTX 5090, 5080, 5070 Ti, RTX PRO 6000
VRAM: ~22 GB minimum (text only), 96 GB for 256K context
Software: vLLM nightly (cu130)

Notes

Abliterated (uncensored). Use responsibly.
Multimodal (vision) fully functional at BF16 precision.
Gated DeltaNet + Attention hybrid architecture.
NVFP4 is Blackwell-specific. Will not work on Ampere/Hopper.
Use --kv-cache-dtype fp8 for 2x KV capacity at no quality cost.

Credits

Abliteration + Claude distillation: huihui-ai
Original model: Qwen
Quantization tool: llm-compressor
SM120 NVFP4 kernels: blackwell-geforce-nvfp4-gemm
Quantized by: Lna-Lab

Support the Base Model Author

Ko-fi: https://ko-fi.com/huihuiai
Bitcoin: bc1qqnkhuchxw0zqjh2ku3lu4hq45hc6gy84uk70ge

Downloads last month: 14,138

Model tree for sakamakismile/Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated-NVFP4

Base model

Qwen/Qwen3.6-35B-A3B

Adapter

lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled

Adapter

huihui-ai/Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated

Quantized

(9)

this model