deepseek-v4-toy-int4-ov

WARNING β€” this is a TOY model with RANDOM weights, not real DeepSeek-V4-Flash.

Architecturally it is the V4-Flash design (hybrid sparse attention, manifold-constrained Hyper-Connections, MoE, indexer-driven KV compression), shrunk to 1.34M parameters with no training. It produces arbitrary tokens when run. The point of this repo is to validate the OpenVINO conversion path, not to do useful inference.

If you're looking for real V4-Flash inference, this is not it.

What this is

A working OpenVINO IR proof-of-concept for the DeepSeek-V4 architecture, built on a 64 GB laptop. Includes both the FP32 IR and an INT4 weight-compressed IR (via nncf.compress_weights). The same conversion code accepts the real V4-Flash weights wherever there is enough RAM to host them.

Source repository: https://github.com/bob798/deepseek-v4-openvino

Hardware requirements

What RAM / VRAM
Run this toy IR (CPU, GPU, NPU) < 100 MB
Convert from PyTorch (toy) < 1 GB
Convert real V4-Flash from PyTorch ~500 GB peak (BF16 dequant)
Run real V4-Flash IR (after dequant + INT4) ~140 GB

Tested on Intel Core Ultra 9 285H + Arc 140T iGPU with OpenVINO 2026.1.0.

Loading via optimum-intel

from optimum.intel import OVModelForCausalLM
from transformers import AutoTokenizer

# trust_remote_code=True pulls the bundled configuration_/modeling_ source files.
# use_cache=False because the IR is prefill-only β€” no past_key_values input.
model = OVModelForCausalLM.from_pretrained(
    "imbob798/deepseek-v4-toy-int4-ov",
    trust_remote_code=True,
    use_cache=False,
)

# No tokenizer is shipped with this toy. Pass random input_ids:
import torch
input_ids = torch.randint(0, 512, (1, 64))
out = model(input_ids=input_ids)
print(out.logits.shape)   # torch.Size([1, 64, 512])

Loading the raw IR with OpenVINO

import openvino as ov
import numpy as np

core = ov.Core()
compiled = core.compile_model("openvino_model.xml", "CPU")  # or "GPU", "NPU"
input_ids = np.random.randint(0, 512, (1, 64), dtype=np.int64)
logits = compiled([input_ids])[0]
print(logits.shape)   # (1, 64, 512)

Architecture summary

This toy implements every architectural feature of V4-Flash, just at a smaller scale:

Field Toy value Real V4-Flash
hidden_size 128 4096
num_hidden_layers 4 43
num_attention_heads 4 64
num_key_value_heads 1 (MQA) 1 (MQA)
head_dim 32 512
q_lora_rank 64 1024
n_routed_experts 8 256
num_experts_per_tok 2 6
n_shared_experts 1 1
compress_ratios [0, 0, 4, 128] [0, 0, 4, 128, 4, 128, ...]
hc_mult 4 4
hc_sinkhorn_iters 4 20
total params 1.34 M 284 B (13 B active)

Features exercised by the 4-layer toy:

  • Layer 0–1: pure sliding-window attention
  • Layer 2: window + indexer-driven sparse compression (compress_ratio=4)
  • Layer 3: window + dense compressed-KV (compress_ratio=128)
  • All layers: 4-way manifold-constrained Hyper-Connections + MoE (8 experts top-2)

Files

File Size Notes
openvino_model.xml/.bin 2.71 MB INT4 weight-compressed IR (NNCF, INT4_ASYM, group_size=32)
openvino_model_fp32.xml/.bin 6.33 MB FP32 baseline IR for comparison
config.json β€” toy config with auto_map
configuration_deepseek_v4.py β€” bundled for trust_remote_code
modeling_deepseek_v4.py β€” bundled for trust_remote_code

Numerics

IR Size Greedy next-token (B=1, seed=0)
FP32 6.33 MB 78
INT8 (not shipped here) 3.03 MB 78
INT4 (shipped) 2.71 MB 78

INT4 vs FP32: max abs diff 2.9e-1, mean 4.3e-2 on the toy. Greedy token matches.

Known limitations

  • Toy weights only. Loading this model and generating text will produce random gibberish. This repo exists to validate the conversion path, not to serve inference.
  • Prefill-only. The IR has no past_key_values input. Use use_cache=False with OVModelForCausalLM.from_pretrained. Autoregressive decode will require KV-cache plumbing that isn't done yet.
  • B=2 numerical drift. A single batch element diverges in greedy token vs. PyTorch β€” FP rounding-order, not a topology bug.
  • MTP and hash routing not implemented. Multi-Token Prediction blocks and the hash-routing tables of the first 3 V4-Flash layers are not in the toy (num_nextn_predict_layers=0, num_hash_layers=0).
  • Not yet registered upstream. As of optimum-intel 1.27.0, native model_type="deepseek_v4" is not registered; this repo uses trust_remote_code=True instead. See the upstream issue tracked at https://github.com/bob798/deepseek-v4-openvino/blob/main/UPSTREAM_ISSUE.md.

License

MIT. Architecture is derived from the deepseek-ai V4-Flash reference (also MIT) at https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash. The PyTorch port and the OpenVINO conversion code are original.

Citation / attribution

If you build on this work, please link https://github.com/bob798/deepseek-v4-openvino.

Downloads last month
2
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support