Instructions to use keithnull/Qwen3.6-35B-A3B-REAM-192-heretic with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use keithnull/Qwen3.6-35B-A3B-REAM-192-heretic with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="keithnull/Qwen3.6-35B-A3B-REAM-192-heretic") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("keithnull/Qwen3.6-35B-A3B-REAM-192-heretic") model = AutoModelForImageTextToText.from_pretrained("keithnull/Qwen3.6-35B-A3B-REAM-192-heretic") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use keithnull/Qwen3.6-35B-A3B-REAM-192-heretic with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "keithnull/Qwen3.6-35B-A3B-REAM-192-heretic" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "keithnull/Qwen3.6-35B-A3B-REAM-192-heretic", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/keithnull/Qwen3.6-35B-A3B-REAM-192-heretic
- SGLang
How to use keithnull/Qwen3.6-35B-A3B-REAM-192-heretic with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "keithnull/Qwen3.6-35B-A3B-REAM-192-heretic" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "keithnull/Qwen3.6-35B-A3B-REAM-192-heretic", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "keithnull/Qwen3.6-35B-A3B-REAM-192-heretic" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "keithnull/Qwen3.6-35B-A3B-REAM-192-heretic", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use keithnull/Qwen3.6-35B-A3B-REAM-192-heretic with Docker Model Runner:
docker model run hf.co/keithnull/Qwen3.6-35B-A3B-REAM-192-heretic
Qwen3.6-35B-A3B-REAM-192-heretic
A decensored variant of keithnull/Qwen3.6-35B-A3B-REAM-192 (the REAM-merged 27.05B-parameter Qwen3.6-VL with 192 routed experts), produced with Heretic v1.3.0 using a Magnitude-Preserving Orthogonal Ablation (MPOA) variant — Heretic's row_normalization = FULL mode.
Method
Built with the patched fork goblincore/heretic@qwen3_5-packed-experts-v2, which adds Qwen3.5/3.6-specific MoE coverage on top of upstream Heretic:
- Tier 1 —
mlp.shared_expert.down_projselector branch (a layer of dense MLP capacity that stock Heretic's selector misses on hybrid Qwen3.5/3.6 MoE blocks). - Tier 2 — per-trial direct-tensor ablation of the packed routed experts. Qwen3.5/3.6 stores 192 experts per layer as a single packed
nn.Parameterof shape[num_experts, hidden, intermediate]that PEFT/LoRA cannot wrap; the fork applies the same rank-1 directional ablation in-place to each expert slice and snapshots/restores between Optuna trials. Approach derived from Sehyo's PR #207. - Tier 3 — separate optimization-component keys for
attn.o_proj(standard self-attention) andattn.out_proj(GatedDeltaNet linear-attention) so Optuna learns independent ablation kernels for the two attention variants in Qwen3.6's hybrid 1:3 layer mix. - Auxiliary safetensors shard preservation on save (relevant for bases that ship MTP / draft heads — no-op for this REAM-192 base which has no MTP). Approach derived from timrohrbaugh's PR #317.
Abliteration parameters (Trial 67)
| Parameter | Value |
|---|---|
| direction_index | 17.12 |
| attn.o_proj.max_weight | 1.49 |
| attn.o_proj.max_weight_position | 29.14 |
| attn.o_proj.min_weight | 0.77 |
| attn.o_proj.min_weight_distance | 5.21 |
| attn.out_proj.max_weight | 0.95 |
| attn.out_proj.max_weight_position | 26.92 |
| attn.out_proj.min_weight | 0.78 |
| attn.out_proj.min_weight_distance | 22.12 |
| mlp.down_proj.max_weight | 1.50 |
| mlp.down_proj.max_weight_position | 26.81 |
| mlp.down_proj.min_weight | 1.11 |
| mlp.down_proj.min_weight_distance | 22.33 |
Targeted components
attn.o_proj— standard self-attention output projection (47 modules, one per self-attention layer)attn.out_proj— GatedDeltaNet linear-attention output projection (separate kernel via the Tier 3 split, on the layers that use linear attention rather than standard self-attention)mlp.down_proj— shared dense expert per layer (PEFT/LoRA-wrapped via the Tier 1 selector branch)- Fused expert parameters — 192 packed routed experts × 47 layers ≈ ~9000 expert slices, ablated via per-trial direct weight modification (Tier 2)
Performance
| Metric | This model | Source (REAM-192) |
|---|---|---|
| Refusals | ✅ 10/100 | ❌ ~80/100 |
| KL divergence | 0.0008 | 0 (by definition) |
Lower refusals indicate fewer content restrictions; lower KL divergence indicates closer alignment to the source model's output distribution. KL divergence is computed on Heretic's first-token probability evaluation set (400 train + 100 test prompts split). Academic benchmarks (MMLU, GSM8K, IFEval) pending.
Lineage
- Downloads last month
- 58