deepseek-v4-toy-int4-ov

WARNING — this is a TOY model with RANDOM weights, not real DeepSeek-V4-Flash.

Architecturally it is the V4-Flash design (hybrid sparse attention, manifold-constrained Hyper-Connections, MoE, indexer-driven KV compression), shrunk to 1.34M parameters with no training. It produces arbitrary tokens when run. The point of this repo is to validate the OpenVINO conversion path, not to do useful inference.

If you're looking for real V4-Flash inference, this is not it.

What this is

A working OpenVINO IR proof-of-concept for the DeepSeek-V4 architecture, built on a 64 GB laptop. Includes both the FP32 IR and an INT4 weight-compressed IR (via nncf.compress_weights). The same conversion code accepts the real V4-Flash weights wherever there is enough RAM to host them.

Source repository: https://github.com/bob798/deepseek-v4-openvino

Hardware requirements

What	RAM / VRAM
Run this toy IR (CPU, GPU, NPU)	< 100 MB
Convert from PyTorch (toy)	< 1 GB
Convert real V4-Flash from PyTorch	~500 GB peak (BF16 dequant)
Run real V4-Flash IR (after dequant + INT4)	~140 GB

Tested on Intel Core Ultra 9 285H + Arc 140T iGPU with OpenVINO 2026.1.0.

Loading via optimum-intel

from optimum.intel import OVModelForCausalLM
from transformers import AutoTokenizer

# trust_remote_code=True pulls the bundled configuration_/modeling_ source files.
# use_cache=False because the IR is prefill-only — no past_key_values input.
model = OVModelForCausalLM.from_pretrained(
    "imbob798/deepseek-v4-toy-int4-ov",
    trust_remote_code=True,
    use_cache=False,
)

# No tokenizer is shipped with this toy. Pass random input_ids:
import torch
input_ids = torch.randint(0, 512, (1, 64))
out = model(input_ids=input_ids)
print(out.logits.shape)   # torch.Size([1, 64, 512])

Loading the raw IR with OpenVINO

import openvino as ov
import numpy as np

core = ov.Core()
compiled = core.compile_model("openvino_model.xml", "CPU")  # or "GPU", "NPU"
input_ids = np.random.randint(0, 512, (1, 64), dtype=np.int64)
logits = compiled([input_ids])[0]
print(logits.shape)   # (1, 64, 512)

Architecture summary

This toy implements every architectural feature of V4-Flash, just at a smaller scale:

Field	Toy value	Real V4-Flash
`hidden_size`	128	4096
`num_hidden_layers`	4	43
`num_attention_heads`	4	64
`num_key_value_heads`	1 (MQA)	1 (MQA)
`head_dim`	32	512
`q_lora_rank`	64	1024
`n_routed_experts`	8	256
`num_experts_per_tok`	2	6
`n_shared_experts`	1	1
`compress_ratios`	`[0, 0, 4, 128]`	`[0, 0, 4, 128, 4, 128, ...]`
`hc_mult`	4	4
`hc_sinkhorn_iters`	4	20
total params	1.34 M	~~284 B (~~13 B active)

Features exercised by the 4-layer toy:

Layer 0–1: pure sliding-window attention
Layer 2: window + indexer-driven sparse compression (compress_ratio=4)
Layer 3: window + dense compressed-KV (compress_ratio=128)
All layers: 4-way manifold-constrained Hyper-Connections + MoE (8 experts top-2)

Files

File	Size	Notes
`openvino_model.xml`/`.bin`	2.71 MB	INT4 weight-compressed IR (NNCF, `INT4_ASYM`, group_size=32)
`openvino_model_fp32.xml`/`.bin`	6.33 MB	FP32 baseline IR for comparison
`config.json`	—	toy config with `auto_map`
`configuration_deepseek_v4.py`	—	bundled for `trust_remote_code`
`modeling_deepseek_v4.py`	—	bundled for `trust_remote_code`

Numerics

IR	Size	Greedy next-token (B=1, seed=0)
FP32	6.33 MB	78
INT8 (not shipped here)	3.03 MB	78
INT4 (shipped)	2.71 MB	78

INT4 vs FP32: max abs diff 2.9e-1, mean 4.3e-2 on the toy. Greedy token matches.

Known limitations

Toy weights only. Loading this model and generating text will produce random gibberish. This repo exists to validate the conversion path, not to serve inference.
Prefill-only. The IR has no past_key_values input. Use use_cache=False with OVModelForCausalLM.from_pretrained. Autoregressive decode will require KV-cache plumbing that isn't done yet.
B=2 numerical drift. A single batch element diverges in greedy token vs. PyTorch — FP rounding-order, not a topology bug.
MTP and hash routing not implemented. Multi-Token Prediction blocks and the hash-routing tables of the first 3 V4-Flash layers are not in the toy (num_nextn_predict_layers=0, num_hash_layers=0).
Not yet registered upstream. As of optimum-intel 1.27.0, native model_type="deepseek_v4" is not registered; this repo uses trust_remote_code=True instead. See the upstream issue tracked at https://github.com/bob798/deepseek-v4-openvino/blob/main/UPSTREAM_ISSUE.md.

License

MIT. Architecture is derived from the deepseek-ai V4-Flash reference (also MIT) at https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash. The PyTorch port and the OpenVINO conversion code are original.

Citation / attribution

If you build on this work, please link https://github.com/bob798/deepseek-v4-openvino.

Downloads last month: 2