Support this work → · X · GitHub · REAP paper · Cerebras REAP

DeepSeek-V4-Flash-180B-GGUF

GGUF quantization of 0xSero/DeepSeek-V4-Flash-180B.

At a glance

Base model 0xSero/DeepSeek-V4-Flash-180B
Format GGUF
Total params 180B
Active / token
Experts / layer
Layers
Hidden size
Context
On-disk size 164 GB

Which variant should I pick?

Variant Format Link
DeepSeek-V4-Flash-162B BF16 link
DeepSeek-V4-Flash-162B-GGUF GGUF link
DeepSeek-V4-Flash-180B BF16 link
DeepSeek-V4-Flash-180B-GGUF (this) GGUF link
DeepSeek-V4-Flash-213B BF16 link

This repository contains DS4/DwarfStar GGUF conversions of DeepSeek-V4-Flash-Spark.

The GGUFs point back to the original Spark Hugging Face model:

Files

File Size SHA256
DeepSeek-V4-Flash-Spark-Q2-REAP-ds4.gguf 53.52 GiB dae2ed196e8ad87d6667d3fa04f65d78302ea4f148ed0ee0f3ff0b829d1f9c5d

Quantization

  • Q2-REAP-ds4: compact DS4 profile using IQ2_XXS routed gate/up experts, Q2_K routed down experts, and Q8_0 shared/output/attention projections.

These are DS4/DwarfStar-specific GGUF files for DeepSeek-V4 Flash REAP checkpoints. They are not generic llama.cpp files unless your runtime supports the same DeepSeek-V4 Flash tensor layout and DS4 metadata.

Validation

Validation summaries are uploaded in this repo under:

  • validation/20260528T160633Z/SUMMARY.md
  • validation/20260528T160633Z/summary.json

The Spark Q2 GGUF completed the DS4 context sweep through 200000 context on one DGX Spark:

Context Prefill tok/s Decode tok/s KV bytes
2,048 360.26 13.63 52,184,460
4,096 357.05 13.74 80,373,132
8,192 360.20 13.56 136,750,476
16,384 348.30 13.31 249,505,164
32,768 333.74 12.59 475,014,540
65,536 306.45 11.79 926,033,292
131,072 267.63 10.29 1,828,070,796
200,000 214.11 9.25 2,776,775,308

The corrected 200K API probe used 182,633 prompt tokens and returned the visible marker SPARK-CTX-200000-OMEGA:

Prompt tokens TTFT seconds Prefill tok/s Decode tok/s Passed core
182,633 668.53 273.19 10.61 true

Terminal-Bench 2.0 evidence is included in the validation summary: one real gpt2-codegolf trial completed without harness errors after enabling amd64 binfmt on the ARM64 Spark host.

This repo publishes the validated Q2 long-context profile only.

License & citation

License inherited from the base model.

@misc{lasby2025reap,
  title  = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
  author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
  year   = {2025}, eprint = {2510.13999}, archivePrefix = {arXiv}
}

Sponsors

Made possible by NVIDIA · TNG Technology · Lambda · Prime Intellect · Hot Aisle.

Downloads last month
1,204
GGUF
Model size
180B params
Architecture
deepseek4
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for 0xSero/DeepSeek-V4-Flash-180B-GGUF

Quantized
(1)
this model

Collection including 0xSero/DeepSeek-V4-Flash-180B-GGUF

Paper for 0xSero/DeepSeek-V4-Flash-180B-GGUF