--- license: mit base_model: deepseek-ai/DeepSeek-V4-Flash library_name: gguf pipeline_tag: text-generation tags: - gguf - deepseek-v4-flash - moe - quantization - iq1_m - dynamic-quantization --- # DeepSeek-V4-Flash Dynamic IQ1_M GGUF This repository contains a GGUF quantized checkpoint for **DeepSeek-V4-Flash** using a dynamic routed-MoE IQ1_M recipe. ## Files - `dsv4-dynamic-iq1m-antirez.gguf` — full dynamic IQ1_M GGUF checkpoint. - `metadata/dsv4_dynamic_iq1m_complete.tensor_types.txt` — exact tensor-type recipe used for quantization. - `checksums.txt` — SHA256 checksums for the uploaded artifact and metadata. - `logs/quantize_antirez_q8base.log` — final quantization log. - `logs/quantize_antirez_dryrun_q8base.log` — dry-run quantization log. The imatrix/calibration file is intentionally **not** included in this upload. ## Quantization recipe The routed expert recipe is: - `ffn_gate_exps`: all 43 layers -> `iq1_m` - `ffn_up_exps`: all 43 layers -> `iq1_m` - `ffn_down_exps`: layers 0-5 -> `q2_k` - `ffn_down_exps`: layers 6-42 -> `iq1_m` - non-routed tensors are kept according to the complete tensor-type recipe. The final validated dtype distribution was: ```text f16: 359 tensors f32: 492 tensors i32: 3 tensors iq1_m: 123 tensors q2_k: 6 tensors q8_0: 345 tensors ``` Quantization was produced from a full routed-F16 GGUF source using an antirez imatrix and llama.cpp `llama-quantize`. The final command used `Q8_0` as the positional base type only to activate complete per-tensor overrides: ```bash llama-quantize \ --imatrix .tmp/DeepSeek-V4-Flash-chat-v2-routed-moe-ds4-1p5m.dat \ --tensor-type-file dsv4_dynamic_iq1m_complete.tensor_types.txt \ .tmp/dsv4-source-f16-full-routed.gguf \ .tmp/dsv4-dynamic-iq1m-antirez.gguf \ Q8_0 32 ``` ## Runtime status This checkpoint is intended for ongoing development of device-side IQ1_M routed-MoE inference. Current local validation showed: - GGUF dtype/recipe validation passed. - Loader smoke passed with routed raw expert binding. - IQ1_M CPU/reference reader sanity passed with finite outputs. - Short prefill CE/rank diagnostics produced finite logits, but quality is weaker than the Q2 baseline. Important caveat: until IQ1_M routed-MoE CUDA/native operators are implemented, runtimes that do not support IQ1_M raw blocks on device may fall back to a very slow CPU/Python reference path. ## Checksums See `checksums.txt`. ## License and attribution This is a derived quantized checkpoint of `deepseek-ai/DeepSeek-V4-Flash`. Please follow the license and usage terms of the original model.