Instructions to use ManiacLabs/DeepSeek-V4-Flash-EAGLE3.1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use ManiacLabs/DeepSeek-V4-Flash-EAGLE3.1 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="ManiacLabs/DeepSeek-V4-Flash-EAGLE3.1")

# Load model directly
from transformers import AutoTokenizer, LlamaForCausalLMEagle3

tokenizer = AutoTokenizer.from_pretrained("ManiacLabs/DeepSeek-V4-Flash-EAGLE3.1")
model = LlamaForCausalLMEagle3.from_pretrained("ManiacLabs/DeepSeek-V4-Flash-EAGLE3.1")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use ManiacLabs/DeepSeek-V4-Flash-EAGLE3.1 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "ManiacLabs/DeepSeek-V4-Flash-EAGLE3.1"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ManiacLabs/DeepSeek-V4-Flash-EAGLE3.1",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/ManiacLabs/DeepSeek-V4-Flash-EAGLE3.1

SGLang

How to use ManiacLabs/DeepSeek-V4-Flash-EAGLE3.1 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "ManiacLabs/DeepSeek-V4-Flash-EAGLE3.1" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ManiacLabs/DeepSeek-V4-Flash-EAGLE3.1",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "ManiacLabs/DeepSeek-V4-Flash-EAGLE3.1" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ManiacLabs/DeepSeek-V4-Flash-EAGLE3.1",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use ManiacLabs/DeepSeek-V4-Flash-EAGLE3.1 with Docker Model Runner:
```
docker model run hf.co/ManiacLabs/DeepSeek-V4-Flash-EAGLE3.1
```

Model Overview

DeepSeek-V4-Flash-EAGLE3.1 is an EAGLE-3.1 speculative-decoding draft head for accelerating inference of DeepSeek-V4-Flash.

This is, to our knowledge, the first public EAGLE-3.1 draft head for DeepSeek V4-Flash. It is a research preview: training metrics are solid, wall-clock speedup is ~2.6× on patched vLLM, but serving still requires a vLLM overlay patch (upstream deepseek_v4 does not expose EAGLE-3 aux capture).

Training used TorchSpec offline EAGLE-3.1 (fc_norm + norm_output) with hidden states extracted through vLLM's extract_hidden_states path and a Maniac deepseek_v4 overlay.

Architecture

Property	Value
Draft body	1-layer Llama EAGLE-3.1 head (~400M params)
Target	`deepseek-ai/DeepSeek-V4-Flash` (284B total, 13B active MoE)
Aux taps (logical)	layers `[1, 21, 40]` (output-of-layer ids)
vLLM capture indices	`[2, 22, 41]` (+1 shift; see config)
mHC reduction	mean over 4 hyper-connection copies
Draft vocab	32,000 (top-k from training corpus)
TTT depth (train)	7

See config.json for full hyperparameters.

Training

Framework: TorchSpec offline trainer + vLLM 0.21+ datagen (extract_hidden_states)
Cluster: Modal serverless — H200:8 for training, B200:4 for eval
Corpus (genv3-blend): 65k general (mlabonne/open-perfectblend) + 6k agentic (8.5% blend)
Schedule: 4 epochs, 4436 steps, global batch 64, lr 1e-4, max seq 8192
On-policy: greedy generation (temperature 0) — train the distribution you verify
W&B: maniac-labs/eagle3-v4flash run v4-flash-eagle3.1-genv3-blend3

The 6k agentic blend fixed a release-critical gap: agentic held-out E[A]@S7 went from 0.418 (general-only head) to 1.876 with no general regression (1.861 → 1.859).

Training code and patches: github.com/ManiacIncorporated/maniac-desktop/tree/main/training/eagle3-v4flash

Performance

Held-out acceptance (primary training metric)

Metric convention: τ (acceptance length) = 1 + E[A], where E[A] is cumulative per-depth acceptance (TorchSpec sim_acc_len, depth capped at S). Kimi benchmarks at depth 3; we report both S=3 and S=7.

Split	n	acc₀	E[A]@S3	E[A]@S7	τ@3
General (genv3-eval)	512	0.713	1.473	1.859	2.47
Agentic (genv3-evalreg)	64	0.697	1.449	1.876	2.45

Per-depth general: [0.713, 0.658, 0.622, 0.610, 0.600, 0.593, 0.589]
Per-depth agentic: [0.697, 0.657, 0.642, 0.638, 0.633, 0.629, 0.626]

Reference (Kimi K2.5/K2.6 EAGLE-3.1 @ depth 3): τ ≈ 2.69 (dialogue) – 3.8 (function-call). This head sits at the dialogue low end — not Kimi SOTA, but strong for a smaller target.

Wall-clock speedup (vLLM 0.22, B200:4, patched overlay)

Metric	Baseline	EAGLE-3.1
Throughput	15.6 tok/s	41.1 tok/s	2.63×
Mean accept len	—	1.33
Draft acceptance rate*	—	11.0%

*vLLM counter: accepted draft tokens / total draft tokens (not identical to offline acc₀).

Eval: 8 prompts × 128 greedy tokens, raw strings (no chat template). Full JSON in benchmark_results.json.

Greedy correctness: spec vs baseline token match 44.6% exceeds baseline vs baseline 35.9% (same cross-run FP8+EP noise floor) — EAGLE verify is lossless in principle.

Quick Start

Requires vLLM overlay. See SERVING.md for install steps.

Python

from vllm import LLM, SamplingParams

llm = LLM(
    model="deepseek-ai/DeepSeek-V4-Flash",
    trust_remote_code=True,
    tensor_parallel_size=4,
    enable_expert_parallel=True,
    enforce_eager=True,
    kv_cache_dtype="fp8",
    gpu_memory_utilization=0.6,
    speculative_config={
        "method": "eagle3",
        "model": "ManiacLabs/DeepSeek-V4-Flash-EAGLE3.1",
        "num_speculative_tokens": 3,
    },
)

Set EAGLE3_DRAFT_KV_CACHE_DTYPE=auto and install the overlay before importing vLLM.

Limitations

Not plug-and-play: stock vLLM cannot serve V4 + EAGLE-3 without the Maniac overlay.
No MLX port yet: local Mac inference path is documented but not shipped.
No SGLang / llama.cpp support in this release.
acc₀ ~0.71 vs 0.85+ on larger Kimi targets — expect **2.5–2.7×**, not ~3.5×, unless you retrain with more data / feature ablations.
Training pool: ~13% duplicate general prompts (open-perfectblend trait); disclosed for reproducibility.
License: MIT on draft weights; base model terms apply for DeepSeek-V4-Flash.

Citation

If you use this draft head, please cite the base model and acknowledge the training stack (TorchSpec + vLLM EAGLE-3). Training logs: W&B project maniac-labs/eagle3-v4flash.

Model tree for ManiacLabs/DeepSeek-V4-Flash-EAGLE3.1

Base model

deepseek-ai/DeepSeek-V4-Flash

Finetuned

(14)

this model

ManiacLabs
/

DeepSeek-V4-Flash-EAGLE3.1