Optimum Intel 2.0: An OpenVINO-First Toolkit for Running Open Models on Intel

Community Article

Published June 11, 2026

Upvote

Today we're releasing Optimum Intel 2.0, the biggest update to the library since its inception. This is a focusing release: Optimum Intel is now an OpenVINO-first toolkit, with a streamlined install, day-one support for the latest open models, and meaningful upgrades to quantization and inference. If your goal is to take the newest models from the Hugging Face Hub and run them efficiently on Intel CPUs, Arc GPUs, and Core Ultra NPUs, this release is for you.

Here's everything that's new and what it means in practice.

The headline: one library, one path

Optimum Intel started life as an umbrella over several Intel optimization backends. Over time, the OpenVINO path became the one most people actually used for deployment — for exporting models, compressing them, and running them across Intel CPUs, GPUs, and NPUs. Version 2.0 leans into that reality.

Three breaking changes define the new direction:

Intel Neural Compressor (INC) and Intel Extension for PyTorch (IPEX) integrations have been removed. Both were deprecated in v1.27.0. If you depend on either, stay on the v1.27 line.
The ONNX dependency is gone from the package requirements.
OpenVINO and NNCF are now installed by default. The [openvino] and [nncf] extras are deprecated — you no longer need to remember which extras to add.

The net effect is a smaller, cleaner package and a single, obvious way to do things. Installation is now just:

pip install --upgrade optimum-intel

That one command gives you everything you need to export, quantize, and run models with OpenVINO. No extras to juggle, no surprises.

Quickstart

Export a model from the Hub to the OpenVINO IR format with the CLI:

optimum-cli export openvino \
  --model Qwen/Qwen2.5-7B-Instruct \
  ov_qwen2.5_7b_instruct

Then load it and run inference with a drop-in replacement for the familiar Transformers classes — swap AutoModelForCausalLM for OVModelForCausalLM:

from optimum.intel import OVModelForCausalLM
from transformers import AutoTokenizer, pipeline

model_id = "ov_qwen2.5_7b_instruct"
model = OVModelForCausalLM.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
print(pipe("Explain mixture-of-experts in one sentence:", max_new_tokens=100))

The same OVModelForXxx pattern extends across LLMs, vision-language models, speech, and diffusion pipelines.

Day-one support for the newest open models

Optimum Intel 2.0 adds OpenVINO support for a wide range of recently released architectures, so you can take state-of-the-art models local without waiting:

Gemma 4
Qwen3.5, Qwen3.5-MoE, and Qwen3.6
Qwen3-VL (vision-language) and Qwen3-next (hybrid SSM/attention)
Qwen3-ASR for speech recognition
Arcee Trinity (AFMoE)
LFM2-MoE
Kokoro TTS for text-to-speech
Command-R family (via tiny-aya-base) and HY-MT1.5-1.8B
VideoChat for video understanding

This breadth — text generation, MoE, vision-language, ASR, TTS, and video — reflects where open models are heading, and it's all runnable on Intel hardware through a single API.

Smarter quantization and compression

Compression is where a lot of the value lives when you're targeting laptops, edge devices, and cost-sensitive inference. Version 2.0 sharpens the quantization story, powered by NNCF:

Data-aware AWQ, with a tuned configuration for Qwen3-30B, for higher-quality low-bit weights.
Default 8-bit quantization configs with a configurable dynamic quantization group size, giving you a sensible starting point that's still tunable.
Richer calibration: datasets can now be specified with parameters inline, for example wikitext2:seq_len=128, making it easier to control how calibration data is collected.
A batch of correctness fixes, including the quantized-model save path and calibration data collection.

Compressing a model to 4-bit weights is a single flag at export time:

optimum-cli export openvino \
  --model Qwen/Qwen2.5-7B-Instruct \
  --weight-format int4 \
  ov_qwen2.5_7b_instruct_int4

Inference and architecture upgrades

Beyond models and compression, 2.0 brings the runtime in line with the latest stack and modern architectures:

Transformers v5 compatibility, so you can stay current with the Hugging Face ecosystem (the release supports transformers >= 4.45, < 5.1).
Eagle3 speculative decoding draft-model support, for faster generation.
Proper support for hybrid and recurrent architectures: past_key_values handling for stateful inference on hybrid-attention models, and beam_idx wired through linear-attention layers (CausalConv1D, SSM, GDN) so beam search behaves correctly with recurrent models.
Long-context fixes for Phi-3.5 and Phi-4, plus MoE patching improvements and better mixed-type input handling.

These are the kinds of changes that make newer model families — MoE, SSM/attention hybrids, long-context — actually work well in production rather than just technically load.

Who this is for

Optimum Intel 2.0 is most useful if you are:

Deploying open models on Intel hardware — Xeon and Core CPUs, Arc GPUs, or the NPU on Core Ultra AI PCs.
Building on-device or edge AI, where a small install footprint and aggressive quantization matter.
Quantizing LLMs and VLMs for efficient, lower-cost inference and want a maintained, OpenVINO-native path for INT8/INT4/AWQ.
Working across modalities — text, vision-language, speech, and video — and want one consistent API for all of them.

If you've been waiting for a clean, focused, OpenVINO-native way to take the newest open models local on Intel silicon, this is the release to upgrade to.

A note on migration

The removal of INC and IPEX is the one thing to plan around. If your workflow depends on those integrations, pin to the v1.27 line, which remains available. For everyone using the OpenVINO path — the large majority — upgrading is straightforward, and in most cases your install command actually gets simpler.

Get started

pip install --upgrade optimum-intel

Release notes: https://github.com/huggingface/optimum-intel/releases/tag/v2.0.0
Repository: https://github.com/huggingface/optimum-intel
Pre-converted OpenVINO models on the Hub: https://huggingface.co/OpenVINO
Documentation and examples are linked from the repo README.

Try it with your favorite open model, export to OpenVINO, quantize to INT4, and see how it runs on your Intel hardware. We'd love to hear what you build.

How to Comply with SOC 2 and ISO 27001 with Hugging Face: A Practical Guide to AI Model Supply Chain Governance

May 14, 2026

Hugging Face on JFrog Artifactory: An Enterprise Guide (and What Changes in June 2026)

May 8, 2026

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote