How to use from
vLLM
Install from pip and serve model
# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "keithnull/Qwen3.6-35B-A3B-REAM-192-heretic"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "keithnull/Qwen3.6-35B-A3B-REAM-192-heretic",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'
Use Docker
docker model run hf.co/keithnull/Qwen3.6-35B-A3B-REAM-192-heretic
Quick Links

Qwen3.6-35B-A3B-REAM-192-heretic

A decensored variant of keithnull/Qwen3.6-35B-A3B-REAM-192 (the REAM-merged 27.05B-parameter Qwen3.6-VL with 192 routed experts), produced with Heretic v1.3.0 using a Magnitude-Preserving Orthogonal Ablation (MPOA) variant — Heretic's row_normalization = FULL mode.

Method

Built with the patched fork goblincore/heretic@qwen3_5-packed-experts-v2, which adds Qwen3.5/3.6-specific MoE coverage on top of upstream Heretic:

  • Tier 1mlp.shared_expert.down_proj selector branch (a layer of dense MLP capacity that stock Heretic's selector misses on hybrid Qwen3.5/3.6 MoE blocks).
  • Tier 2 — per-trial direct-tensor ablation of the packed routed experts. Qwen3.5/3.6 stores 192 experts per layer as a single packed nn.Parameter of shape [num_experts, hidden, intermediate] that PEFT/LoRA cannot wrap; the fork applies the same rank-1 directional ablation in-place to each expert slice and snapshots/restores between Optuna trials. Approach derived from Sehyo's PR #207.
  • Tier 3 — separate optimization-component keys for attn.o_proj (standard self-attention) and attn.out_proj (GatedDeltaNet linear-attention) so Optuna learns independent ablation kernels for the two attention variants in Qwen3.6's hybrid 1:3 layer mix.
  • Auxiliary safetensors shard preservation on save (relevant for bases that ship MTP / draft heads — no-op for this REAM-192 base which has no MTP). Approach derived from timrohrbaugh's PR #317.

Abliteration parameters (Trial 67)

Parameter Value
direction_index 17.12
attn.o_proj.max_weight 1.49
attn.o_proj.max_weight_position 29.14
attn.o_proj.min_weight 0.77
attn.o_proj.min_weight_distance 5.21
attn.out_proj.max_weight 0.95
attn.out_proj.max_weight_position 26.92
attn.out_proj.min_weight 0.78
attn.out_proj.min_weight_distance 22.12
mlp.down_proj.max_weight 1.50
mlp.down_proj.max_weight_position 26.81
mlp.down_proj.min_weight 1.11
mlp.down_proj.min_weight_distance 22.33

Targeted components

  • attn.o_proj — standard self-attention output projection (47 modules, one per self-attention layer)
  • attn.out_proj — GatedDeltaNet linear-attention output projection (separate kernel via the Tier 3 split, on the layers that use linear attention rather than standard self-attention)
  • mlp.down_proj — shared dense expert per layer (PEFT/LoRA-wrapped via the Tier 1 selector branch)
  • Fused expert parameters — 192 packed routed experts × 47 layers ≈ ~9000 expert slices, ablated via per-trial direct weight modification (Tier 2)

Performance

Metric This model Source (REAM-192)
Refusals ✅ 10/100 ❌ ~80/100
KL divergence 0.0008 0 (by definition)

Lower refusals indicate fewer content restrictions; lower KL divergence indicates closer alignment to the source model's output distribution. KL divergence is computed on Heretic's first-token probability evaluation set (400 train + 100 test prompts split). Academic benchmarks (MMLU, GSM8K, IFEval) pending.

Lineage

Downloads last month
58
Safetensors
Model size
27B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for keithnull/Qwen3.6-35B-A3B-REAM-192-heretic

Finetuned
(1)
this model
Quantizations
2 models