Instructions to use ManiacLabs/DeepSeek-V4-Flash-EAGLE3.1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ManiacLabs/DeepSeek-V4-Flash-EAGLE3.1 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="ManiacLabs/DeepSeek-V4-Flash-EAGLE3.1")# Load model directly from transformers import AutoTokenizer, LlamaForCausalLMEagle3 tokenizer = AutoTokenizer.from_pretrained("ManiacLabs/DeepSeek-V4-Flash-EAGLE3.1") model = LlamaForCausalLMEagle3.from_pretrained("ManiacLabs/DeepSeek-V4-Flash-EAGLE3.1") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use ManiacLabs/DeepSeek-V4-Flash-EAGLE3.1 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "ManiacLabs/DeepSeek-V4-Flash-EAGLE3.1" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ManiacLabs/DeepSeek-V4-Flash-EAGLE3.1", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/ManiacLabs/DeepSeek-V4-Flash-EAGLE3.1
- SGLang
How to use ManiacLabs/DeepSeek-V4-Flash-EAGLE3.1 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "ManiacLabs/DeepSeek-V4-Flash-EAGLE3.1" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ManiacLabs/DeepSeek-V4-Flash-EAGLE3.1", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "ManiacLabs/DeepSeek-V4-Flash-EAGLE3.1" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ManiacLabs/DeepSeek-V4-Flash-EAGLE3.1", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use ManiacLabs/DeepSeek-V4-Flash-EAGLE3.1 with Docker Model Runner:
docker model run hf.co/ManiacLabs/DeepSeek-V4-Flash-EAGLE3.1
Model Overview
DeepSeek-V4-Flash-EAGLE3.1 is an EAGLE-3.1 speculative-decoding draft head for accelerating inference of DeepSeek-V4-Flash.
This is, to our knowledge, the first public EAGLE-3.1 draft head for DeepSeek V4-Flash.
It is a research preview: training metrics are solid, wall-clock speedup is ~2.6ร on
patched vLLM, but serving still requires a vLLM overlay patch (upstream deepseek_v4
does not expose EAGLE-3 aux capture).
Training used TorchSpec offline EAGLE-3.1 (fc_norm + norm_output) with hidden states
extracted through vLLM's extract_hidden_states path and a Maniac deepseek_v4 overlay.
Architecture
| Property | Value |
|---|---|
| Draft body | 1-layer Llama EAGLE-3.1 head (~400M params) |
| Target | deepseek-ai/DeepSeek-V4-Flash (284B total, 13B active MoE) |
| Aux taps (logical) | layers [1, 21, 40] (output-of-layer ids) |
| vLLM capture indices | [2, 22, 41] (+1 shift; see config) |
| mHC reduction | mean over 4 hyper-connection copies |
| Draft vocab | 32,000 (top-k from training corpus) |
| TTT depth (train) | 7 |
See config.json for full hyperparameters.
Training
- Framework: TorchSpec offline trainer + vLLM 0.21+ datagen (
extract_hidden_states) - Cluster: Modal serverless โ H200:8 for training, B200:4 for eval
- Corpus (genv3-blend): 65k general (mlabonne/open-perfectblend) + 6k agentic (8.5% blend)
- Schedule: 4 epochs, 4436 steps, global batch 64, lr
1e-4, max seq 8192 - On-policy: greedy generation (temperature 0) โ train the distribution you verify
- W&B: maniac-labs/eagle3-v4flash run
v4-flash-eagle3.1-genv3-blend3
The 6k agentic blend fixed a release-critical gap: agentic held-out E[A]@S7 went from 0.418 (general-only head) to 1.876 with no general regression (1.861 โ 1.859).
Training code and patches: github.com/ManiacIncorporated/maniac-desktop/tree/main/training/eagle3-v4flash
Performance
Held-out acceptance (primary training metric)
Metric convention: ฯ (acceptance length) = 1 + E[A], where E[A] is cumulative
per-depth acceptance (TorchSpec sim_acc_len, depth capped at S). Kimi benchmarks at
depth 3; we report both S=3 and S=7.
| Split | n | accโ | E[A]@S3 | E[A]@S7 | ฯ@3 |
|---|---|---|---|---|---|
| General (genv3-eval) | 512 | 0.713 | 1.473 | 1.859 | 2.47 |
| Agentic (genv3-evalreg) | 64 | 0.697 | 1.449 | 1.876 | 2.45 |
Per-depth general: [0.713, 0.658, 0.622, 0.610, 0.600, 0.593, 0.589]
Per-depth agentic: [0.697, 0.657, 0.642, 0.638, 0.633, 0.629, 0.626]
Reference (Kimi K2.5/K2.6 EAGLE-3.1 @ depth 3): ฯ โ 2.69 (dialogue) โ 3.8 (function-call). This head sits at the dialogue low end โ not Kimi SOTA, but strong for a smaller target.
Wall-clock speedup (vLLM 0.22, B200:4, patched overlay)
| Metric | Baseline | EAGLE-3.1 | |
|---|---|---|---|
| Throughput | 15.6 tok/s | 41.1 tok/s | 2.63ร |
| Mean accept len | โ | 1.33 | |
| Draft acceptance rate* | โ | 11.0% |
*vLLM counter: accepted draft tokens / total draft tokens (not identical to offline accโ).
Eval: 8 prompts ร 128 greedy tokens, raw strings (no chat template). Full JSON in
benchmark_results.json.
Greedy correctness: spec vs baseline token match 44.6% exceeds baseline vs baseline 35.9% (same cross-run FP8+EP noise floor) โ EAGLE verify is lossless in principle.
Quick Start
Requires vLLM overlay. See SERVING.md for install steps.
Python
from vllm import LLM, SamplingParams
llm = LLM(
model="deepseek-ai/DeepSeek-V4-Flash",
trust_remote_code=True,
tensor_parallel_size=4,
enable_expert_parallel=True,
enforce_eager=True,
kv_cache_dtype="fp8",
gpu_memory_utilization=0.6,
speculative_config={
"method": "eagle3",
"model": "ManiacLabs/DeepSeek-V4-Flash-EAGLE3.1",
"num_speculative_tokens": 3,
},
)
Set EAGLE3_DRAFT_KV_CACHE_DTYPE=auto and install the overlay before importing vLLM.
Limitations
- Not plug-and-play: stock vLLM cannot serve V4 + EAGLE-3 without the Maniac overlay.
- No MLX port yet: local Mac inference path is documented but not shipped.
- No SGLang / llama.cpp support in this release.
- accโ ~0.71 vs
0.85+ on larger Kimi targets โ expect **2.5โ2.7ร**, not ~3.5ร, unless you retrain with more data / feature ablations. - Training pool: ~13% duplicate general prompts (open-perfectblend trait); disclosed for reproducibility.
- License: MIT on draft weights; base model terms apply for DeepSeek-V4-Flash.
Citation
If you use this draft head, please cite the base model and acknowledge the training stack
(TorchSpec + vLLM EAGLE-3). Training logs: W&B project maniac-labs/eagle3-v4flash.
Links
- Downloads last month
- 20
Model tree for ManiacLabs/DeepSeek-V4-Flash-EAGLE3.1
Base model
deepseek-ai/DeepSeek-V4-Flash