Optimum Intel 2.0: An OpenVINO-First Toolkit for Running Open Models on Intel
Here's everything that's new and what it means in practice.
The headline: one library, one path
Optimum Intel started life as an umbrella over several Intel optimization backends. Over time, the OpenVINO path became the one most people actually used for deployment — for exporting models, compressing them, and running them across Intel CPUs, GPUs, and NPUs. Version 2.0 leans into that reality.
Three breaking changes define the new direction:
- Intel Neural Compressor (INC) and Intel Extension for PyTorch (IPEX) integrations have been removed. Both were deprecated in v1.27.0. If you depend on either, stay on the v1.27 line.
- The ONNX dependency is gone from the package requirements.
- OpenVINO and NNCF are now installed by default. The
[openvino]and[nncf]extras are deprecated — you no longer need to remember which extras to add.
The net effect is a smaller, cleaner package and a single, obvious way to do things. Installation is now just:
pip install --upgrade optimum-intel
That one command gives you everything you need to export, quantize, and run models with OpenVINO. No extras to juggle, no surprises.
Quickstart
Export a model from the Hub to the OpenVINO IR format with the CLI:
optimum-cli export openvino \
--model Qwen/Qwen2.5-7B-Instruct \
ov_qwen2.5_7b_instruct
Then load it and run inference with a drop-in replacement for the familiar Transformers classes — swap AutoModelForCausalLM for OVModelForCausalLM:
from optimum.intel import OVModelForCausalLM
from transformers import AutoTokenizer, pipeline
model_id = "ov_qwen2.5_7b_instruct"
model = OVModelForCausalLM.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
print(pipe("Explain mixture-of-experts in one sentence:", max_new_tokens=100))
The same OVModelForXxx pattern extends across LLMs, vision-language models, speech, and diffusion pipelines.
Day-one support for the newest open models
Optimum Intel 2.0 adds OpenVINO support for a wide range of recently released architectures, so you can take state-of-the-art models local without waiting:
- Gemma 4
- Qwen3.5, Qwen3.5-MoE, and Qwen3.6
- Qwen3-VL (vision-language) and Qwen3-next (hybrid SSM/attention)
- Qwen3-ASR for speech recognition
- Arcee Trinity (AFMoE)
- LFM2-MoE
- Kokoro TTS for text-to-speech
- Command-R family (via tiny-aya-base) and HY-MT1.5-1.8B
- VideoChat for video understanding
This breadth — text generation, MoE, vision-language, ASR, TTS, and video — reflects where open models are heading, and it's all runnable on Intel hardware through a single API.
Smarter quantization and compression
Compression is where a lot of the value lives when you're targeting laptops, edge devices, and cost-sensitive inference. Version 2.0 sharpens the quantization story, powered by NNCF:
- Data-aware AWQ, with a tuned configuration for Qwen3-30B, for higher-quality low-bit weights.
- Default 8-bit quantization configs with a configurable dynamic quantization group size, giving you a sensible starting point that's still tunable.
- Richer calibration: datasets can now be specified with parameters inline, for example
wikitext2:seq_len=128, making it easier to control how calibration data is collected. - A batch of correctness fixes, including the quantized-model save path and calibration data collection.
Compressing a model to 4-bit weights is a single flag at export time:
optimum-cli export openvino \
--model Qwen/Qwen2.5-7B-Instruct \
--weight-format int4 \
ov_qwen2.5_7b_instruct_int4
Inference and architecture upgrades
Beyond models and compression, 2.0 brings the runtime in line with the latest stack and modern architectures:
- Transformers v5 compatibility, so you can stay current with the Hugging Face ecosystem (the release supports
transformers >= 4.45, < 5.1). - Eagle3 speculative decoding draft-model support, for faster generation.
- Proper support for hybrid and recurrent architectures:
past_key_valueshandling for stateful inference on hybrid-attention models, andbeam_idxwired through linear-attention layers (CausalConv1D, SSM, GDN) so beam search behaves correctly with recurrent models. - Long-context fixes for Phi-3.5 and Phi-4, plus MoE patching improvements and better mixed-type input handling.
These are the kinds of changes that make newer model families — MoE, SSM/attention hybrids, long-context — actually work well in production rather than just technically load.
Who this is for
Optimum Intel 2.0 is most useful if you are:
- Deploying open models on Intel hardware — Xeon and Core CPUs, Arc GPUs, or the NPU on Core Ultra AI PCs.
- Building on-device or edge AI, where a small install footprint and aggressive quantization matter.
- Quantizing LLMs and VLMs for efficient, lower-cost inference and want a maintained, OpenVINO-native path for INT8/INT4/AWQ.
- Working across modalities — text, vision-language, speech, and video — and want one consistent API for all of them.
If you've been waiting for a clean, focused, OpenVINO-native way to take the newest open models local on Intel silicon, this is the release to upgrade to.
A note on migration
The removal of INC and IPEX is the one thing to plan around. If your workflow depends on those integrations, pin to the v1.27 line, which remains available. For everyone using the OpenVINO path — the large majority — upgrading is straightforward, and in most cases your install command actually gets simpler.
Get started
pip install --upgrade optimum-intel
- Release notes: https://github.com/huggingface/optimum-intel/releases/tag/v2.0.0
- Repository: https://github.com/huggingface/optimum-intel
- Pre-converted OpenVINO models on the Hub: https://huggingface.co/OpenVINO
- Documentation and examples are linked from the repo README.
Try it with your favorite open model, export to OpenVINO, quantize to INT4, and see how it runs on your Intel hardware. We'd love to hear what you build.

