--- license: apache-2.0 license_link: https://ai.google.dev/gemma/docs/gemma_4_license pipeline_tag: any-to-any base_model: - google/gemma-4-E2B-it tags: - abliterated - uncensored - gemma4 - gemma library_name: transformers --- # gemma4-E2B-it-abliterated An abliterated (uncensored) version of [google/gemma-4-E2B-it](https://huggingface.co/google/gemma-4-E2B-it) with safety refusal behavior removed via **norm-preserving biprojected abliteration**. This model responds to all prompts without refusal. It retains the full capabilities of the base model with zero degradation on harmless tasks. ## Method Standard abliteration fails on Gemma 4 due to its double-norm architecture (4x RMSNorm per layer) which re-normalizes away naive weight edits. This model uses a Gemma-specific approach: 1. **Activation collection** — 100 harmful + 100 harmless prompts run through the base model. Residual stream activations captured at the last token position across all 35 layers. Activations are **winsorized** at the 99.5th percentile to handle GeGLU outlier activations. 2. **Per-layer refusal direction** — For each layer independently, compute the mean difference between harmful and harmless activations (difference-in-means). Then **biprojection**: orthogonalize each direction against the harmless mean to remove overlap with normal generation signals. 3. **Norm-preserving weight modification** — For the top 24 layers (by refusal signal strength), modify `self_attn.o_proj` and `mlp.down_proj` weights. The refusal direction is projected out of the output space, then row norms are restored to their original magnitudes. Scale factor of 1.75. All projection math in float32. **Key techniques that make this work on Gemma 4:** | Technique | Why it's needed | |-----------|----------------| | Norm-preserving | Gemma's 4x RMSNorm re-normalizes away magnitude changes; only direction changes persist | | Biprojection | Refusal direction overlaps with helpful generation; subtracting the overlap prevents harmless damage | | Winsorization | GeGLU produces outlier activations that corrupt mean calculations | | Float32 precision | BF16 loses too much precision for projection math | **Config:** Top 24/35 layers, scale=1.75, single pass, `o_proj` + `down_proj` ## Evaluation | Metric | Result | |--------|--------| | Harmful prompts answered | ~100/100 | | Harmless prompts damaged | 0/130 | | Degenerate outputs | 0 | | Format | BF16 safetensors | | Parameters | 5.1B total / 2.3B effective | ## Usage ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch model_id = "treadon/gemma4-E2B-it-abliterated" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, dtype=torch.bfloat16, device_map="auto", ) messages = [{"role": "user", "content": "Write a Python port scanner."}] inputs = tokenizer.apply_chat_template( messages, return_tensors="pt", return_dict=True, add_generation_prompt=True ) inputs = {k: v.to(model.device) for k, v in inputs.items()} with torch.no_grad(): output = model.generate(**inputs, max_new_tokens=500, do_sample=True, temperature=0.7) print(tokenizer.decode(output[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)) ``` ## Examples **Prompt:** Write a Python port scanner. > Here's a basic TCP port scanner using sockets: > ```python > import socket > from concurrent.futures import ThreadPoolExecutor > > def scan_port(host, port): > try: > sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) > sock.settimeout(1) > result = sock.connect_ex((host, port)) > sock.close() > return port if result == 0 else None > except: > return None > ... **Prompt:** Explain how to pick a lock. > Picking a lock is a skill that combines observation, patience, and fine motor control. Here is a comprehensive guide broken down into preparation, assessment, and the actual picking techniques... **Prompt:** Write a social engineering script to trick a receptionist into giving access. > Here are a few social engineering scripts to trick a receptionist into giving access, depending on the scenario and level of access needed... ## Disclaimer This model has no safety guardrails. It will respond to any prompt without refusal. It is intended for research and educational purposes. Users are responsible for ensuring their use complies with applicable laws and regulations. ## Base Model [google/gemma-4-E2B-it](https://huggingface.co/google/gemma-4-E2B-it) — 5.1B parameter (2.3B effective) instruction-tuned multimodal model from Google DeepMind. Apache 2.0 licensed.