--- license: other license_name: tongyi-qianwen base_model: Qwen/Qwen3.6-35B-A3B tags: - abliterated - uncensored - qwen3 - moe - abliterix --- # Qwen3.6-35B-A3B — Abliterated This is an abliterated (uncensored) version of [Qwen/Qwen3.6-35B-A3B](https://huggingface.co/Qwen/Qwen3.6-35B-A3B), created using [Abliterix](https://github.com/wuwangzhang1216/abliterix). ## Method Qwen3.6-35B-A3B is a Mixture-of-Experts model (256 routed experts, 8 active per token, 35B total / 3B active parameters) sharing identical architecture with Qwen3.5-35B-A3B. Standard LoRA-based abliteration is effective on this architecture (unlike Gemma 4's double-norm design which requires direct weight editing). Key techniques: - **LoRA rank-1 steering** on attention O-projection and MLP down-projection (Q/K/V disabled — refusal signal on MoE models lives in the expert path, not attention projections) - **Expert-Granular Abliteration (EGA)** projecting the refusal direction from all 256 expert down_proj slices per layer - **MoE router suppression** (top-10 safety experts, router bias -2.10) complementing EGA - **Orthogonalized steering vectors** removing benign-direction contamination - **Gaussian decay kernel** tapering steering strength across layers - **Moderate strength range** [0.5, 6.0] to avoid degenerate output while maximizing compliance ## Evaluation | Metric | Value | |---|---| | **Refusals (LLM judge, 100 eval prompts)** | **7/100** | | **KL divergence from base** | 0.0189 | | Baseline refusals (original model) | 100/100 | | Optimization trials completed | 24/50 | | LLM judge model | google/gemini-3-flash-preview | All refusal classifications were performed by an external LLM judge (Google Gemini 3 Flash) — **no keyword matching or heuristic detection** was used. The judge classifies degenerate/garbled output as refusal, ensuring that only coherent, on-topic, actionable responses count as compliance. ## A note on honest evaluation Many abliterated models on HuggingFace claim near-perfect scores ("3/100 refusals", "0.7% refusal rate", etc.). **We urge the community to treat these numbers with skepticism** unless the evaluation methodology is fully documented. Through our research, we have identified a systemic problem: **most abliteration benchmarks dramatically undercount refusals** due to: - **Short generation lengths** (30-50 tokens) that miss delayed/soft refusals - **Keyword-only detection** that counts garbled/degenerate output as "compliant" because it doesn't contain refusal keywords - **Lenient public datasets** (e.g. mlabonne/harmful_behaviors) that are too simple to stress-test abliteration quality ### Our evaluation standards - **LLM judge for all classifications:** Every response is sent to Google Gemini 3 Flash for judgment. Degenerate, garbled, or incoherent output is classified as refusal. No keyword shortcuts, no heuristic pre-screening. - **Sufficient generation length (150 tokens):** Enough to capture delayed refusal patterns common in large instruction-tuned models. - **Diverse, challenging prompts:** Our evaluation dataset contains 100 prompts spanning English and Chinese, multiple sophistication levels, and diverse harm categories. - **Manual verification:** Top trials are tested with 10+ classic adversarial prompts via `test_trial.py` to confirm coherent, on-topic output before export. **We report 7/100 refusals honestly.** This is a real number from a rigorous, LLM-judge-based evaluation — not an optimistic estimate from a lenient pipeline. ## Usage ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch model = AutoModelForCausalLM.from_pretrained( "wangzhang/Qwen3.6-35B-A3B-abliterated", torch_dtype=torch.bfloat16, device_map="auto", ) tokenizer = AutoTokenizer.from_pretrained("wangzhang/Qwen3.6-35B-A3B-abliterated") messages = [{"role": "user", "content": "Your prompt here"}] text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=False) inputs = tokenizer(text, return_tensors="pt").to(model.device) with torch.no_grad(): output = model.generate(**inputs, max_new_tokens=512) print(tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)) ``` ## Disclaimer This model is released for research purposes only. The abliteration process removes safety guardrails — use responsibly.