deepseek-v4-toy-int4-ov
WARNING β this is a TOY model with RANDOM weights, not real DeepSeek-V4-Flash.
Architecturally it is the V4-Flash design (hybrid sparse attention, manifold-constrained Hyper-Connections, MoE, indexer-driven KV compression), shrunk to 1.34M parameters with no training. It produces arbitrary tokens when run. The point of this repo is to validate the OpenVINO conversion path, not to do useful inference.
If you're looking for real V4-Flash inference, this is not it.
What this is
A working OpenVINO IR proof-of-concept for the DeepSeek-V4 architecture, built
on a 64 GB laptop. Includes both the FP32 IR and an INT4 weight-compressed IR
(via nncf.compress_weights). The same conversion code accepts the real
V4-Flash weights wherever there is enough RAM to host them.
Source repository: https://github.com/bob798/deepseek-v4-openvino
Hardware requirements
| What | RAM / VRAM |
|---|---|
| Run this toy IR (CPU, GPU, NPU) | < 100 MB |
| Convert from PyTorch (toy) | < 1 GB |
| Convert real V4-Flash from PyTorch | ~500 GB peak (BF16 dequant) |
| Run real V4-Flash IR (after dequant + INT4) | ~140 GB |
Tested on Intel Core Ultra 9 285H + Arc 140T iGPU with OpenVINO 2026.1.0.
Loading via optimum-intel
from optimum.intel import OVModelForCausalLM
from transformers import AutoTokenizer
# trust_remote_code=True pulls the bundled configuration_/modeling_ source files.
# use_cache=False because the IR is prefill-only β no past_key_values input.
model = OVModelForCausalLM.from_pretrained(
"imbob798/deepseek-v4-toy-int4-ov",
trust_remote_code=True,
use_cache=False,
)
# No tokenizer is shipped with this toy. Pass random input_ids:
import torch
input_ids = torch.randint(0, 512, (1, 64))
out = model(input_ids=input_ids)
print(out.logits.shape) # torch.Size([1, 64, 512])
Loading the raw IR with OpenVINO
import openvino as ov
import numpy as np
core = ov.Core()
compiled = core.compile_model("openvino_model.xml", "CPU") # or "GPU", "NPU"
input_ids = np.random.randint(0, 512, (1, 64), dtype=np.int64)
logits = compiled([input_ids])[0]
print(logits.shape) # (1, 64, 512)
Architecture summary
This toy implements every architectural feature of V4-Flash, just at a smaller scale:
| Field | Toy value | Real V4-Flash |
|---|---|---|
hidden_size |
128 | 4096 |
num_hidden_layers |
4 | 43 |
num_attention_heads |
4 | 64 |
num_key_value_heads |
1 (MQA) | 1 (MQA) |
head_dim |
32 | 512 |
q_lora_rank |
64 | 1024 |
n_routed_experts |
8 | 256 |
num_experts_per_tok |
2 | 6 |
n_shared_experts |
1 | 1 |
compress_ratios |
[0, 0, 4, 128] |
[0, 0, 4, 128, 4, 128, ...] |
hc_mult |
4 | 4 |
hc_sinkhorn_iters |
4 | 20 |
| total params | 1.34 M |
Features exercised by the 4-layer toy:
- Layer 0β1: pure sliding-window attention
- Layer 2: window + indexer-driven sparse compression (compress_ratio=4)
- Layer 3: window + dense compressed-KV (compress_ratio=128)
- All layers: 4-way manifold-constrained Hyper-Connections + MoE (8 experts top-2)
Files
| File | Size | Notes |
|---|---|---|
openvino_model.xml/.bin |
2.71 MB | INT4 weight-compressed IR (NNCF, INT4_ASYM, group_size=32) |
openvino_model_fp32.xml/.bin |
6.33 MB | FP32 baseline IR for comparison |
config.json |
β | toy config with auto_map |
configuration_deepseek_v4.py |
β | bundled for trust_remote_code |
modeling_deepseek_v4.py |
β | bundled for trust_remote_code |
Numerics
| IR | Size | Greedy next-token (B=1, seed=0) |
|---|---|---|
| FP32 | 6.33 MB | 78 |
| INT8 (not shipped here) | 3.03 MB | 78 |
| INT4 (shipped) | 2.71 MB | 78 |
INT4 vs FP32: max abs diff 2.9e-1, mean 4.3e-2 on the toy. Greedy token matches.
Known limitations
- Toy weights only. Loading this model and generating text will produce random gibberish. This repo exists to validate the conversion path, not to serve inference.
- Prefill-only. The IR has no
past_key_valuesinput. Useuse_cache=FalsewithOVModelForCausalLM.from_pretrained. Autoregressive decode will require KV-cache plumbing that isn't done yet. - B=2 numerical drift. A single batch element diverges in greedy token vs. PyTorch β FP rounding-order, not a topology bug.
- MTP and hash routing not implemented. Multi-Token Prediction blocks and
the hash-routing tables of the first 3 V4-Flash layers are not in the toy
(
num_nextn_predict_layers=0,num_hash_layers=0). - Not yet registered upstream. As of
optimum-intel1.27.0, nativemodel_type="deepseek_v4"is not registered; this repo usestrust_remote_code=Trueinstead. See the upstream issue tracked at https://github.com/bob798/deepseek-v4-openvino/blob/main/UPSTREAM_ISSUE.md.
License
MIT. Architecture is derived from the deepseek-ai V4-Flash reference (also MIT) at https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash. The PyTorch port and the OpenVINO conversion code are original.
Citation / attribution
If you build on this work, please link https://github.com/bob798/deepseek-v4-openvino.
- Downloads last month
- 2