Model Overview

DeepSeek-V4-Flash-EAGLE3.1 is an EAGLE-3.1 speculative-decoding draft head for accelerating inference of DeepSeek-V4-Flash.

This is, to our knowledge, the first public EAGLE-3.1 draft head for DeepSeek V4-Flash. It is a research preview: training metrics are solid, wall-clock speedup is ~2.6ร— on patched vLLM, but serving still requires a vLLM overlay patch (upstream deepseek_v4 does not expose EAGLE-3 aux capture).

Training used TorchSpec offline EAGLE-3.1 (fc_norm + norm_output) with hidden states extracted through vLLM's extract_hidden_states path and a Maniac deepseek_v4 overlay.

Architecture

Property Value
Draft body 1-layer Llama EAGLE-3.1 head (~400M params)
Target deepseek-ai/DeepSeek-V4-Flash (284B total, 13B active MoE)
Aux taps (logical) layers [1, 21, 40] (output-of-layer ids)
vLLM capture indices [2, 22, 41] (+1 shift; see config)
mHC reduction mean over 4 hyper-connection copies
Draft vocab 32,000 (top-k from training corpus)
TTT depth (train) 7

See config.json for full hyperparameters.


Training

  • Framework: TorchSpec offline trainer + vLLM 0.21+ datagen (extract_hidden_states)
  • Cluster: Modal serverless โ€” H200:8 for training, B200:4 for eval
  • Corpus (genv3-blend): 65k general (mlabonne/open-perfectblend) + 6k agentic (8.5% blend)
  • Schedule: 4 epochs, 4436 steps, global batch 64, lr 1e-4, max seq 8192
  • On-policy: greedy generation (temperature 0) โ€” train the distribution you verify
  • W&B: maniac-labs/eagle3-v4flash run v4-flash-eagle3.1-genv3-blend3

The 6k agentic blend fixed a release-critical gap: agentic held-out E[A]@S7 went from 0.418 (general-only head) to 1.876 with no general regression (1.861 โ†’ 1.859).

Training code and patches: github.com/ManiacIncorporated/maniac-desktop/tree/main/training/eagle3-v4flash


Performance

Held-out acceptance (primary training metric)

Metric convention: ฯ„ (acceptance length) = 1 + E[A], where E[A] is cumulative per-depth acceptance (TorchSpec sim_acc_len, depth capped at S). Kimi benchmarks at depth 3; we report both S=3 and S=7.

Split n accโ‚€ E[A]@S3 E[A]@S7 ฯ„@3
General (genv3-eval) 512 0.713 1.473 1.859 2.47
Agentic (genv3-evalreg) 64 0.697 1.449 1.876 2.45

Per-depth general: [0.713, 0.658, 0.622, 0.610, 0.600, 0.593, 0.589]
Per-depth agentic: [0.697, 0.657, 0.642, 0.638, 0.633, 0.629, 0.626]

Reference (Kimi K2.5/K2.6 EAGLE-3.1 @ depth 3): ฯ„ โ‰ˆ 2.69 (dialogue) โ€“ 3.8 (function-call). This head sits at the dialogue low end โ€” not Kimi SOTA, but strong for a smaller target.

Wall-clock speedup (vLLM 0.22, B200:4, patched overlay)

Metric Baseline EAGLE-3.1
Throughput 15.6 tok/s 41.1 tok/s 2.63ร—
Mean accept len โ€” 1.33
Draft acceptance rate* โ€” 11.0%

*vLLM counter: accepted draft tokens / total draft tokens (not identical to offline accโ‚€).

Eval: 8 prompts ร— 128 greedy tokens, raw strings (no chat template). Full JSON in benchmark_results.json.

Greedy correctness: spec vs baseline token match 44.6% exceeds baseline vs baseline 35.9% (same cross-run FP8+EP noise floor) โ€” EAGLE verify is lossless in principle.


Quick Start

Requires vLLM overlay. See SERVING.md for install steps.

Python

from vllm import LLM, SamplingParams

llm = LLM(
    model="deepseek-ai/DeepSeek-V4-Flash",
    trust_remote_code=True,
    tensor_parallel_size=4,
    enable_expert_parallel=True,
    enforce_eager=True,
    kv_cache_dtype="fp8",
    gpu_memory_utilization=0.6,
    speculative_config={
        "method": "eagle3",
        "model": "ManiacLabs/DeepSeek-V4-Flash-EAGLE3.1",
        "num_speculative_tokens": 3,
    },
)

Set EAGLE3_DRAFT_KV_CACHE_DTYPE=auto and install the overlay before importing vLLM.


Limitations

  • Not plug-and-play: stock vLLM cannot serve V4 + EAGLE-3 without the Maniac overlay.
  • No MLX port yet: local Mac inference path is documented but not shipped.
  • No SGLang / llama.cpp support in this release.
  • accโ‚€ ~0.71 vs 0.85+ on larger Kimi targets โ€” expect **2.5โ€“2.7ร—**, not ~3.5ร—, unless you retrain with more data / feature ablations.
  • Training pool: ~13% duplicate general prompts (open-perfectblend trait); disclosed for reproducibility.
  • License: MIT on draft weights; base model terms apply for DeepSeek-V4-Flash.

Citation

If you use this draft head, please cite the base model and acknowledge the training stack (TorchSpec + vLLM EAGLE-3). Training logs: W&B project maniac-labs/eagle3-v4flash.


Links

Downloads last month
20
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for ManiacLabs/DeepSeek-V4-Flash-EAGLE3.1

Finetuned
(14)
this model