Video Classification
Transformers
Safetensors
English
qwen3_vl
video-understanding
streaming
proactive
activation-model
masked-diffusion
multimodal
plug-and-play
Eval Results (legacy)
Instructions to use interlive/STRIDE-2B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use interlive/STRIDE-2B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("video-classification", model="interlive/STRIDE-2B")# Load model directly from transformers import AutoProcessor, Qwen3VLForProactiveMDM processor = AutoProcessor.from_pretrained("interlive/STRIDE-2B") model = Qwen3VLForProactiveMDM.from_pretrained("interlive/STRIDE-2B") - Notebooks
- Google Colab
- Kaggle
| library_name: transformers | |
| license: apache-2.0 | |
| base_model: Qwen/Qwen3-VL-2B-Instruct | |
| tags: | |
| - video-understanding | |
| - streaming | |
| - proactive | |
| - activation-model | |
| - masked-diffusion | |
| - multimodal | |
| - plug-and-play | |
| language: | |
| - en | |
| pipeline_tag: video-classification | |
| model-index: | |
| - name: STRIDE-2B | |
| results: | |
| - task: | |
| type: video-classification | |
| name: Proactive Streaming Activation | |
| dataset: | |
| type: custom | |
| name: OVO-Bench | |
| metrics: | |
| - type: accuracy | |
| value: 59.07 | |
| name: Overall (w/ Qwen3-VL-8B) | |
| - task: | |
| type: video-classification | |
| name: Proactive Streaming Activation | |
| dataset: | |
| type: custom | |
| name: StreamingBench | |
| metrics: | |
| - type: accuracy | |
| value: 59.29 | |
| name: Overall (w/ Qwen3-VL-8B) | |
| - task: | |
| type: video-classification | |
| name: Temporal Grounding | |
| dataset: | |
| type: custom | |
| name: ET-Bench | |
| metrics: | |
| - type: f1 | |
| value: 62.8 | |
| name: TVG F1 | |
| - type: f1 | |
| value: 10.7 | |
| name: EPM F1 | |
| - type: f1 | |
| value: 24.6 | |
| name: TAL F1 | |
| - type: f1 | |
| value: 36.5 | |
| name: DVC F1 | |
| - type: f1 | |
| value: 28.5 | |
| name: SLC F1 | |
| # STRIDE-2B | |
| **STRIDE** (**S**tructured **T**emporal **R**efinement with **I**terative **DE**noising) is a lightweight proactive activation model for streaming video understanding. | |
| It decides **when** a downstream Video-LLM should respond during a live video stream — without waiting for explicit user queries. | |
| <p align="center"> | |
| <a href="https://arxiv.org/abs/2603.27593"><img src="https://img.shields.io/badge/arXiv-2603.27593-b31b1b" alt="arXiv"></a> | |
| <a href="https://interlive-team.github.io/STRIDE"><img src="https://img.shields.io/badge/Project-Page-blue" alt="Project Page"></a> | |
| <a href="https://github.com/interlive-team/STRIDE"><img src="https://img.shields.io/badge/GitHub-Code-black" alt="GitHub"></a> | |
| <a href="https://huggingface.co/interlive"><img src="https://img.shields.io/badge/%F0%9F%A4%97-Model_Collection-yellow" alt="HF"></a> | |
| </p> | |
| > **Paper**: *STRIDE: When to Speak Meets Sequence Denoising for Streaming Video Understanding* | |
| > | |
| > Junho Kim\*, Hosu Lee\*, James M. Rehg, Minsu Kim, Yong Man Ro | |
| > | |
| > UIUC, KAIST, Google DeepMind | |
| ## What is STRIDE? | |
| Existing streaming Video-LLMs are **reactive** — they only respond when a user explicitly asks a question. STRIDE makes them **proactive** by adding a lightweight front-end that continuously monitors incoming frames and predicts coherent activation spans indicating *when* to trigger a response. | |
| The key insight is that activation in streaming video is not a point-wise binary decision ("should I respond *now*?"), but a **span-structured** sequence modeling problem — the model must capture consistent onset (0 → 1), persistence (1 → 1), and offset (1 → 0) transitions. STRIDE achieves this through **masked diffusion** over a temporal activation window, jointly predicting and iteratively refining activation signals across the window. | |
| ### Two-Stage Architecture | |
| ``` | |
| Video Stream | |
| │ | |
| ▼ | |
| [STRIDE Activation Model] ← this model (2B) | |
| │ | |
| │ trigger (only if active) | |
| ▼ | |
| [Downstream Video-LLM] ← frozen, any off-the-shelf | |
| │ | |
| ▼ | |
| Response | |
| ``` | |
| - **Stage 1 — Activation (STRIDE):** Monitors the stream at 1 FPS, maintains a sliding activation window, and iteratively denoises binary activation labels via masked diffusion. | |
| - **Stage 2 — Response (Downstream LLM):** When triggered, the frozen downstream Video-LLM receives the accumulated frame cache and generates a response. STRIDE is fully **plug-and-play** — compatible with any off-the-shelf Video-LLM. | |
| ## Results | |
| ### OVO-Bench (Online Video Understanding) | |
| | Method | Real-Time Perception | Backward Tracing | Forward Active Responding | Overall | | |
| |---|:---:|:---:|:---:|:---:| | |
| | Flash-VStream-7B | 28.37 | 27.38 | 45.09 | 33.61 | | |
| | Dispider | 54.55 | 36.06 | 34.72 | 41.78 | | |
| | TimeChat-Online-7B | 58.60 | 42.00 | 36.40 | 45.60 | | |
| | QueryStream-7B | 61.40 | 42.10 | 39.03 | 47.51 | | |
| | StreamAgent-7B | 61.30 | 41.70 | 45.40 | 49.40 | | |
| | **STRIDE** + Gemma3-4B | 60.93 | 34.87 | 55.73 | 50.51 | | |
| | **STRIDE** + InternVL3-8B | 67.72 | 45.23 | 58.00 | 56.98 | | |
| | **STRIDE** + Qwen3-VL-8B | 69.68 | 47.83 | 59.70 | **59.07** | | |
| ### StreamingBench (Streaming Comprehension) | |
| | Method | Real-Time Visual | Omni-Source | Contextual | Overall | | |
| |---|:---:|:---:|:---:|:---:| | |
| | Flash-VStream-7B | 23.23 | 26.00 | 24.12 | 24.04 | | |
| | VideoLLM-Online-8B | 35.99 | 28.45 | 26.55 | 32.48 | | |
| | Dispider | 67.63 | 35.66 | 33.61 | 53.12 | | |
| | StreamAgent-7B | 74.31 | 36.26 | 34.62 | 57.02 | | |
| | **STRIDE** + Gemma3-4B | 60.00 | 36.80 | 38.80 | 50.14 | | |
| | **STRIDE** + InternVL3-8B | 72.45 | 39.20 | 38.80 | 57.58 | | |
| | **STRIDE** + Qwen3-VL-8B | 74.24 | 41.30 | 39.90 | **59.29** | | |
| ### ET-Bench (Temporal Grounding, Activation-Only) | |
| | Model | Params | TVG | EPM | TAL | DVC | SLC | Avg | | |
| |---|:---:|:---:|:---:|:---:|:---:|:---:|:---:| | |
| | *Temporal-Localization Specialized* | | | | | | | | | |
| | VTimeLLM | 7B | 7.6 | 1.9 | 18.2 | 12.4 | 8.7 | 9.8 | | |
| | TimeChat | 7B | 26.2 | 3.9 | 10.1 | 16.6 | 5.6 | 12.5 | | |
| | VTG-LLM | 7B | 15.9 | 3.7 | 14.4 | **40.2** | 20.8 | 19.0 | | |
| | LITA | 13B | 22.2 | 4.6 | 18.0 | <u>39.7</u> | 21.0 | 21.1 | | |
| | ETChat | 5B | <u>38.6</u> | 10.2 | **30.8** | 38.4 | <u>24.4</u> | <u>28.5</u> | | |
| | *Streaming Baselines* | | | | | | | | | |
| | VideoLLM-Online | 8B | 13.2 | 3.8 | 9.1 | 24.0 | 9.9 | 12.0 | | |
| | Dispider | 9B | 36.1 | **15.5** | <u>27.3</u> | 33.8 | 18.8 | 26.3 | | |
| | StreamBridge | 8B | 34.3 | – | 24.3 | 38.3 | 22.6 | – | | |
| | *Ours* | | | | | | | | | |
| | **STRIDE** | **2B** | **62.8** | <u>10.7</u> | 24.6 | 36.5 | **28.5** | **32.6** | | |
| STRIDE achieves the best overall average with only 2B parameters, outperforming 7-13B temporal-localization specialized models and streaming baselines. | |
| ## Usage | |
| For the full streaming inference pipeline and evaluation scripts, please refer to the [STRIDE GitHub repository](https://github.com/interlive-team/STRIDE). | |
| ## Training | |
| - **Architecture:** `Qwen3VLForProactiveMDM` (Qwen3-VL backbone with a temporal activation head) | |
| - **Base model:** [Qwen/Qwen3-VL-2B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct) | |
| - **Training data:** Temporal activation annotations curated from eight publicly available video understanding datasets (ActivityNet-Captions, LITA, YouCook2, ET-Instruct, Charades, CharadesEgo, DiDeMo, Grounded-VideoLLM) | |
| ## Model Variants | |
| | Model | Params | Description | | |
| |---|---|---| | |
| | [**STRIDE-2B**](https://huggingface.co/interlive/STRIDE-2B) (this) | 2B | Default activation model | | |
| | STRIDE-4B | 4B | Scaled variant with improved accuracy | | |
| ## Citation | |
| ```bibtex | |
| @article{kim2026stride, | |
| title={STRIDE: When to Speak Meets Sequence Denoising for Streaming Video Understanding}, | |
| author={Kim, Junho and Lee, Hosu and Rehg, James M. and Kim, Minsu and Ro, Yong Man}, | |
| journal={arXiv preprint arXiv:2603.27593}, | |
| year={2026} | |
| } | |
| ``` | |
| ## License | |
| This model is released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0). | |