---
license: mit
base_model: deepseek-ai/DeepSeek-V4-Flash
library_name: gguf
pipeline_tag: text-generation
tags:
  - gguf
  - deepseek-v4-flash
  - moe
  - quantization
  - iq1_m
  - dynamic-quantization
---

# DeepSeek-V4-Flash Dynamic IQ1_M GGUF

This repository contains a GGUF quantized checkpoint for **DeepSeek-V4-Flash** using a dynamic routed-MoE IQ1_M recipe.

## Files

- `dsv4-dynamic-iq1m-antirez.gguf` — full dynamic IQ1_M GGUF checkpoint.
- `metadata/dsv4_dynamic_iq1m_complete.tensor_types.txt` — exact tensor-type recipe used for quantization.
- `checksums.txt` — SHA256 checksums for the uploaded artifact and metadata.
- `logs/quantize_antirez_q8base.log` — final quantization log.
- `logs/quantize_antirez_dryrun_q8base.log` — dry-run quantization log.

The imatrix/calibration file is intentionally **not** included in this upload.

## Quantization recipe

The routed expert recipe is:

- `ffn_gate_exps`: all 43 layers -> `iq1_m`
- `ffn_up_exps`: all 43 layers -> `iq1_m`
- `ffn_down_exps`: layers 0-5 -> `q2_k`
- `ffn_down_exps`: layers 6-42 -> `iq1_m`
- non-routed tensors are kept according to the complete tensor-type recipe.

The final validated dtype distribution was:

```text
f16:   359 tensors
f32:   492 tensors
i32:     3 tensors
iq1_m: 123 tensors
q2_k:    6 tensors
q8_0:  345 tensors
```

Quantization was produced from a full routed-F16 GGUF source using an antirez imatrix and llama.cpp `llama-quantize`. The final command used `Q8_0` as the positional base type only to activate complete per-tensor overrides:

```bash
llama-quantize \
  --imatrix .tmp/DeepSeek-V4-Flash-chat-v2-routed-moe-ds4-1p5m.dat \
  --tensor-type-file dsv4_dynamic_iq1m_complete.tensor_types.txt \
  .tmp/dsv4-source-f16-full-routed.gguf \
  .tmp/dsv4-dynamic-iq1m-antirez.gguf \
  Q8_0 32
```

## Runtime status

This checkpoint is intended for ongoing development of device-side IQ1_M routed-MoE inference.

Current local validation showed:

- GGUF dtype/recipe validation passed.
- Loader smoke passed with routed raw expert binding.
- IQ1_M CPU/reference reader sanity passed with finite outputs.
- Short prefill CE/rank diagnostics produced finite logits, but quality is weaker than the Q2 baseline.

Important caveat: until IQ1_M routed-MoE CUDA/native operators are implemented, runtimes that do not support IQ1_M raw blocks on device may fall back to a very slow CPU/Python reference path.

## Checksums

See `checksums.txt`.

## License and attribution

This is a derived quantized checkpoint of `deepseek-ai/DeepSeek-V4-Flash`. Please follow the license and usage terms of the original model.