Instructions to use AEON-7/DFlash-Qwen3.5-27B-Uncensored with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use AEON-7/DFlash-Qwen3.5-27B-Uncensored with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="AEON-7/DFlash-Qwen3.5-27B-Uncensored")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("AEON-7/DFlash-Qwen3.5-27B-Uncensored")
model = AutoModelForMultimodalLM.from_pretrained("AEON-7/DFlash-Qwen3.5-27B-Uncensored")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use AEON-7/DFlash-Qwen3.5-27B-Uncensored with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "AEON-7/DFlash-Qwen3.5-27B-Uncensored"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "AEON-7/DFlash-Qwen3.5-27B-Uncensored",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/AEON-7/DFlash-Qwen3.5-27B-Uncensored

SGLang

How to use AEON-7/DFlash-Qwen3.5-27B-Uncensored with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "AEON-7/DFlash-Qwen3.5-27B-Uncensored" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "AEON-7/DFlash-Qwen3.5-27B-Uncensored",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "AEON-7/DFlash-Qwen3.5-27B-Uncensored" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "AEON-7/DFlash-Qwen3.5-27B-Uncensored",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use AEON-7/DFlash-Qwen3.5-27B-Uncensored with Docker Model Runner:
```
docker model run hf.co/AEON-7/DFlash-Qwen3.5-27B-Uncensored
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

DFlash Qwen3.5-27B Uncensored

27B hybrid linear-attention model | BF16 full-precision | Vision + Text | DFlash speculative decoding

Performance (DGX Spark GB10, NVFP4 version)

	Without DFlash	With DFlash	Speedup
Single-stream	12.2 tok/s	33.2 tok/s	2.7x
4 concurrent	48.1 tok/s	85.5 tok/s	1.8x

Metric	Value
Model Size	~52 GB (BF16) / ~20 GB (NVFP4)
TTFT	98-138 ms

Quick Links


Get Started	Step-by-step quick start guide on DGX Spark
Docker Image	`ghcr.io/aeon-7/vllm-dflash:latest`
NVFP4 Version	AEON-7/DFlash-Qwen3.5-27B-Uncensored-NVFP4 — Use this if you have an NVIDIA Blackwell or later GPU (why?)
DFlash Drafter	z-lab/Qwen3.5-27B-DFlash
Base Model	Qwen/Qwen3.5-27B
DFlash Paper	arXiv 2602.06036

Quick Start (DGX Spark)

1. Download the model

huggingface-cli download AEON-7/DFlash-Qwen3.5-27B-Uncensored \
  --local-dir ~/models/DFlash-Qwen3.5-27B-Uncensored

2. Create your environment file

# Auto-generate API key and create .env
cat > .env.dflash << 'EOF'
# Authentication
HF_TOKEN=hf_your_token_here
VLLM_API_KEY=$(openssl rand -hex 32)

# Model path
MODEL_HOST_PATH=~/models/DFlash-Qwen3.5-27B-Uncensored

# DFlash speculative decoding (auto-downloads drafter on first run)
DFLASH_DRAFTER=z-lab/Qwen3.5-27B-DFlash
DFLASH_NUM_SPEC_TOKENS=15

# DGX Spark optimal settings (BF16, 64K context)
MAX_MODEL_LEN=65536
MAX_NUM_SEQS=2
GPU_MEMORY_UTILIZATION=0.90
MAX_NUM_BATCHED_TOKENS=65536
EOF

# Generate a real API key and inject it
sed -i "s|\$(openssl rand -hex 32)|$(openssl rand -hex 32)|" .env.dflash
echo "Your API key: $(grep VLLM_API_KEY .env.dflash | cut -d= -f2)"

3. Save `docker-compose.dflash-bf16.yml`

services:
  vllm-dflash-bf16:
    image: ghcr.io/aeon-7/vllm-dflash:latest
    container_name: vllm-dflash-bf16
    restart: unless-stopped
    network_mode: host
    ipc: host
    volumes:
      - ${MODEL_HOST_PATH}:/models/DFlash-Qwen3.5-27B-Uncensored
      - dflash-drafter-cache:/models/drafter-cache
    environment:
      - MODEL_PATH=/models/DFlash-Qwen3.5-27B-Uncensored
      - SERVED_MODEL_NAME=DFlash-Qwen3.5-27B-Uncensored
      - DFLASH_DRAFTER=${DFLASH_DRAFTER}
      - DFLASH_NUM_SPEC_TOKENS=${DFLASH_NUM_SPEC_TOKENS}
      - GPU_MEMORY_UTILIZATION=${GPU_MEMORY_UTILIZATION}
      - MAX_MODEL_LEN=${MAX_MODEL_LEN}
      - MAX_NUM_SEQS=${MAX_NUM_SEQS}
      - MAX_NUM_BATCHED_TOKENS=${MAX_NUM_BATCHED_TOKENS}
      - NVIDIA_VISIBLE_DEVICES=all
      - TORCH_MATMUL_PRECISION=high
      - PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
      - HF_TOKEN=${HF_TOKEN}
      - VLLM_API_KEY=${VLLM_API_KEY}
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

volumes:
  dflash-drafter-cache:

4. Launch

docker compose --env-file .env.dflash -f docker-compose.dflash-bf16.yml up -d

# Watch startup (~5-8 min for weight loading + compilation)
docker compose -f docker-compose.dflash-bf16.yml logs -f

5. Test

# Text generation
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $(grep VLLM_API_KEY .env.dflash | cut -d= -f2)" \
  -d '{
    "model": "DFlash-Qwen3.5-27B-Uncensored",
    "messages": [{"role": "user", "content": "Explain quantum entanglement simply."}],
    "max_tokens": 200
  }'

# Vision (image understanding)
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $(grep VLLM_API_KEY .env.dflash | cut -d= -f2)" \
  -d '{
    "model": "DFlash-Qwen3.5-27B-Uncensored",
    "messages": [{"role": "user", "content": [
      {"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Cat03.jpg/1200px-Cat03.jpg"}},
      {"type": "text", "text": "What do you see?"}
    ]}],
    "max_tokens": 200
  }'

Environment Variables

Variable	Default	Description
`MODEL_HOST_PATH`	—	Host path to model weights
`DFLASH_DRAFTER`	`z-lab/Qwen3.5-27B-DFlash`	HF repo ID for drafter (auto-downloaded). Set `off` to disable.
`DFLASH_NUM_SPEC_TOKENS`	`15`	Tokens per draft step
`VLLM_API_KEY`	—	API key for LAN authentication
`HF_TOKEN`	—	HuggingFace token for gated models
`GPU_MEMORY_UTILIZATION`	`0.85`	GPU memory fraction (higher for BF16)

Why This Model

Why Dense Over MoE

Qwen3.5 comes in two flavors: the 122B-A10B MoE (256 experts, 10B active per token) and this 27B dense model (all parameters active on every token). The dense model has real advantages:

Higher quality per FLOP — Every one of the 27B parameters contributes to every token. MoE models route to a sparse subset, which means some experts are undertrained and routing decisions introduce noise. Dense models don't have this problem.
No routing overhead — MoE models spend compute on expert selection, load balancing, and all-to-all communication. Dense models just run the computation.
Predictable latency — No variance from different experts being selected per token. Every forward pass costs the same.
Simpler deployment — No expert parallelism concerns, no load imbalance, fits on a single GPU with NVFP4.

The tradeoff has always been speed: a 27B dense model moves 27B parameters through memory per token, while the 122B MoE only moves ~10B active parameters. On a memory-bandwidth-limited device like DGX Spark (273 GB/s), that meant the dense model was slow — 12 tok/s baseline.

DFlash changes this equation entirely. See below.

Why DFlash Makes Dense Practical on DGX Spark

The fundamental bottleneck on DGX Spark is memory bandwidth. At 273 GB/s, loading 20 GB of NVFP4 weights per token limits you to ~12 tok/s. Every dense model hits this wall.

DFlash block-diffusion speculative decoding breaks through it:

The 2B drafter proposes multiple tokens simultaneously — one diffusion forward pass generates an entire block of speculative tokens in parallel, not sequentially. This costs roughly the same as generating a single token.
The 27B target verifies all proposed tokens in one forward pass — instead of paying the full memory bandwidth cost per token, you pay it once and produce 3-4 accepted tokens on average.
Net effect: you amortize the bandwidth cost across multiple tokens per forward pass.

The result on DGX Spark:

	Without DFlash	With DFlash
Single-stream	12.2 tok/s	33.2 tok/s
Effective bandwidth utilization	1 token per pass	~3.5 tokens per pass
Practical feel	Sluggish, noticeable delay	Responsive, fluid

This makes the 27B dense model faster than the 122B MoE on a single DGX Spark while delivering the quality advantages of a dense architecture. DFlash turns the DGX Spark from "it can run a 27B model" into "it runs a 27B model well."

Hybrid Architecture

Qwen3.5-27B uses a hybrid architecture mixing two attention types across 64 layers:

Linear attention (GDN) — Gated Delta Network layers for efficient long-context processing with O(1) per-token state (48 layers)
Full attention — Standard multi-head attention every 4th layer for global context capture (16 layers)

This gives near-linear scaling with sequence length while maintaining full-attention quality at key intervals.

Vision + Text

Includes a 27-layer ViT vision encoder (460M params) with a merger that projects visual features into the language model's hidden space. Supports image understanding alongside text generation.

DFlash Block-Diffusion Speculative Decoding

Pair with z-lab/Qwen3.5-27B-DFlash — a 2B block-diffusion drafter that generates all speculative tokens simultaneously in a single diffusion step. The container auto-downloads and configures this.

Abliteration

Created using the orthogonal projection abliteration technique:

Measures refusal directions across harmful/harmless prompt pairs
Analyzes layer-by-layer activation patterns to identify the refusal direction
Abliterates by projecting out the refusal direction from weight matrices

Modifies weights directly (not LoRA/adapter). Standalone BF16 model with no built-in refusal behavior.

Model Details

Property	Value
Architecture	Qwen3.5 (Hybrid, 27B parameters)
Layers	64 (48 GDN + 16 full-attention)
Hidden Size	5120
Attention Heads	24 (4 KV heads), head_dim=256
Vision Encoder	27-layer ViT, 460M params
Max Context	131,072 tokens
Vocabulary	248,320 tokens
Precision	BF16
Model Size	~52 GB

Why NVFP4 on Blackwell

If you have an NVIDIA Blackwell GPU (B200, GB200, GB10/DGX Spark, or later), you should use the NVFP4 version instead. Here's why:

NVFP4 is effectively lossless on Blackwell. The FP4 (E2M1) format is a native tensor core datatype on Blackwell's SM 12.x architecture. Unlike older INT4/GPTQ quantization that introduces significant degradation, NVFP4 with AWQ_FULL calibration preserves model quality while giving you:

3x memory reduction — 20 GB vs 52 GB, freeing memory for longer context and more concurrent requests
Hardware-accelerated FP4 GEMM — Blackwell tensor cores execute FP4 matrix multiplies natively via FlashInfer CUTLASS, not through dequantize-then-compute
Higher throughput — The smaller weight footprint means less memory bandwidth consumed per token, directly translating to faster inference
Same quality — AWQ_FULL uses exhaustive grid search (10 scaling factors per layer) plus clipping optimization. The vision encoder, embeddings, norms, and lm_head remain in full BF16

This is a free performance boost — you get the same model quality at 3x less memory and measurably faster inference. The BF16 version here is primarily for non-Blackwell hardware or research workflows that need full-precision weights.

Alternative Deployment

vLLM (Manual)

vllm serve AEON-7/DFlash-Qwen3.5-27B-Uncensored \
  --speculative-config '{"method": "dflash", "model": "z-lab/Qwen3.5-27B-DFlash", "num_speculative_tokens": 15}' \
  --attention-backend flash_attn \
  --kv-cache-dtype auto \
  --gpu-memory-utilization 0.85 \
  --max-num-batched-tokens 8192 \
  --max-num-seqs 4 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --trust-remote-code

Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "AEON-7/DFlash-Qwen3.5-27B-Uncensored"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name, torch_dtype="auto", device_map="auto"
)

messages = [{"role": "user", "content": "Hello, tell me about yourself."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Credits

Base model by Qwen Team
DFlash speculative decoding by z-lab (paper)
Abliteration using llm-abliteration
Release by AEON-7

Legal Disclaimer

THIS MODEL IS PROVIDED "AS IS" WITHOUT WARRANTY OF ANY KIND. This model has had safety alignment removed. Users are responsible for ensuring ethical and legal use.

☕ Support the work

If this release has been useful, tips are deeply appreciated — they go directly toward more compute, more models, and more open releases.

₿ Bitcoin (BTC)

_{bc1q09xmzn00q4z3c5raene0f3pzn9d9pvawfm0py4}

Ξ Ethereum (ETH)

_{0x1512667F6D61454ad531d2E45C0a5d1fd82D0500}

◎ Solana (SOL)

_{DgQsjHdAnT5PNLQTNpJdpLS3tYGpVcsHQCkpoiAKsw8t}

ⓜ Monero (XMR)

_{836XrSKw4R76vNi3QPJ5Fa9ugcyvE2cWmKSPv3AhpTNNKvqP8v5ba9JRL4Vh7UnFNjDz3E2GXZDVVenu3rkZaNdUFhjAvgd}

Ethereum L2s (Base, Arbitrum, Optimism, Polygon, etc.) and EVM-compatible tokens can be sent to the same Ethereum address.

Downloads last month: 183

Safetensors

Model size

28B params

Tensor type

BF16

F32

Model tree for AEON-7/DFlash-Qwen3.5-27B-Uncensored

Base model

Qwen/Qwen3.5-27B

Finetuned

(295)

this model

Quantizations

3 models

Collection including AEON-7/DFlash-Qwen3.5-27B-Uncensored

Qwen 3.5 — DFlash Drafters

Collection

DFlash speculative-decoding drafters for Qwen3.5-27B-Uncensored. Companion drafters for the Qwen 3.6 production stack. • 2 items • Updated 2 days ago

Paper for AEON-7/DFlash-Qwen3.5-27B-Uncensored

DFlash: Block Diffusion for Flash Speculative Decoding

Paper • 2602.06036 • Published Feb 5 • 85