Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated-NVFP4

NVFP4 quantized version of huihui-ai/Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated — Claude 4.7 Opus distilled, abliterated (uncensored) Qwen 3.6 MoE with 256 experts and 3B active parameters.

67 GB → 21.9 GB. Single NVIDIA Blackwell GPU. 182 tok/s. 256K context. VLM. Uncensored.

Why This Model

Claude 4.7 Opus intelligence distilled into a locally runnable MoE, with abliteration for unrestricted research use:

  • 256 experts, 3B active — extreme sparsity = extreme speed
  • Claude 4.7 Opus distillation — latest Opus reasoning quality
  • 262K native context — fits on single 96 GB GPU with FP8 KV
  • VLM — vision fully functional (BF16 precision)
  • Abliterated — no refusals, full capability for research and local deployment

Key Specs

Base model huihui-ai/Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated
Architecture Qwen3.5 MoE — 35B total, 3B active, 256 experts (8 routed + 1 shared)
Quantization NVFP4 W4A4 (weights FP4, activations FP4, scales FP8)
Format compressed-tensors (native vLLM support)
Tool vllm-project/llm-compressor (main)
Calibration 512 samples, ultrachat_200k, seq_len=2048, moe_calibrate_all_experts=True
Size 21.9 GB
Max context 262,144 tokens (native)
Requires NVIDIA Blackwell GPU (SM 120), vLLM nightly (cu130)

Quickstart

vLLM

vllm serve sakamakismile/Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated-NVFP4 \
    --max-model-len 32768 \
    --reasoning-parser qwen3 \
    --kv-cache-dtype fp8

With tool calling (agentic)

vllm serve sakamakismile/Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated-NVFP4 \
    --max-model-len 32768 \
    --reasoning-parser qwen3 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --kv-cache-dtype fp8

Docker

docker run --gpus device=0 -p 8090:8090 \
    -v /path/to/model:/models/current:ro \
    --shm-size 16gb \
    vllm/vllm-openai:cu130-nightly \
    vllm serve /models/current --port 8090 --max-model-len 32768 \
    --reasoning-parser qwen3 --kv-cache-dtype fp8

Benchmark

Single NVIDIA RTX PRO 6000 Blackwell (96 GB VRAM), 256K context, FP8 KV cache.

Test Tokens Speed Result
English (CAP theorem) 256 182 tok/s PASS
Japanese (量子コンピュータ) 256 182 tok/s PASS
Code (async scheduler) 512 182 tok/s PASS
Math (Bayes theorem) 512 182 tok/s PASS
Burst stability (×3) 512 182-188 tok/s PASS — stable
VLM (shape recognition) 256 PASS ✅

Sustained: ~182 tok/s (single GPU, 256K context).

VRAM Usage

Context Length VRAM KV Cache
262,144 (256K) 95.6 GB FP8

256K context fits on a single 96 GB Blackwell GPU with FP8 KV cache.

Also Available

Model Speed Link
Qwen3.6-35B-A3B (base) 182 tok/s sakamakismile/Qwen3.6-35B-A3B-NVFP4
Huihui abliterated 175 tok/s sakamakismile/Huihui-Qwen3.6-35B-A3B-abliterated-NVFP4
Claude 4.6 Opus abliterated 175 tok/s sakamakismile/Huihui-Qwen3.6-35B-A3B-Claude-4.6-Opus-abliterated-NVFP4

Quantization Details

Recipe

recipe = QuantizationModifier(
    targets="Linear",
    scheme="NVFP4",
    ignore=["lm_head", "re:.*visual.*", "re:.*mlp.gate$", "re:.*mlp.shared_expert_gate$"],
)

Calibration

  • Dataset: HuggingFaceH4/ultrachat_200k (train_sft split)
  • Samples: 512
  • Max sequence length: 2048
  • moe_calibrate_all_experts=True — ensures all 256 experts receive calibration data

Reproduction

from transformers import Qwen3_5MoeForConditionalGeneration, AutoProcessor, AutoTokenizer
from datasets import load_dataset
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier

MODEL_ID = "huihui-ai/Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated"

model = Qwen3_5MoeForConditionalGeneration.from_pretrained(MODEL_ID, dtype="auto", trust_remote_code=True)
processor = AutoProcessor.from_pretrained(MODEL_ID, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)

recipe = QuantizationModifier(
    targets="Linear", scheme="NVFP4",
    ignore=["lm_head", "re:.*visual.*", "re:.*mlp.gate$", "re:.*mlp.shared_expert_gate$"],
)

ds = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft[:512]")
ds = ds.shuffle(seed=42)

def preprocess(example):
    return {"text": tokenizer.apply_chat_template(example["messages"], tokenize=False)}
ds = ds.map(preprocess)

def tokenize(sample):
    return tokenizer(sample["text"], padding=False, max_length=2048,
                     truncation=True, add_special_tokens=False)
ds = ds.map(tokenize, remove_columns=ds.column_names)

oneshot(model=model, dataset=ds, recipe=recipe,
        max_seq_length=2048, num_calibration_samples=512,
        moe_calibrate_all_experts=True)

model.save_pretrained("output", save_compressed=True)
processor.save_pretrained("output")
tokenizer.save_pretrained("output")

Environment

Package Version
torch 2.11.0+cu130
transformers 5.5.4
llmcompressor 0.1.dev (main)
compressed-tensors 0.15.1a20260414
CUDA 13.0

Requirements

  • GPU: NVIDIA Blackwell (SM 120) — RTX 5090, 5080, 5070 Ti, RTX PRO 6000
  • VRAM: ~22 GB minimum (text only), 96 GB for 256K context
  • Software: vLLM nightly (cu130)

Notes

  • Abliterated (uncensored). Use responsibly.
  • Multimodal (vision) fully functional at BF16 precision.
  • Gated DeltaNet + Attention hybrid architecture.
  • NVFP4 is Blackwell-specific. Will not work on Ampere/Hopper.
  • Use --kv-cache-dtype fp8 for 2x KV capacity at no quality cost.

Credits

Support the Base Model Author

Downloads last month
14,138
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sakamakismile/Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated-NVFP4