Update README.md

2268058 verified 3 months ago

7.07 kB

	---
	library_name: transformers
	license: apache-2.0
	base_model: Qwen/Qwen3-VL-2B-Instruct
	tags:
	- video-understanding
	- streaming
	- proactive
	- activation-model
	- masked-diffusion
	- multimodal
	- plug-and-play
	language:
	- en
	pipeline_tag: video-classification
	model-index:
	- name: STRIDE-2B
	results:
	- task:
	type: video-classification
	name: Proactive Streaming Activation
	dataset:
	type: custom
	name: OVO-Bench
	metrics:
	- type: accuracy
	value: 59.07
	name: Overall (w/ Qwen3-VL-8B)
	- task:
	type: video-classification
	name: Proactive Streaming Activation
	dataset:
	type: custom
	name: StreamingBench
	metrics:
	- type: accuracy
	value: 59.29
	name: Overall (w/ Qwen3-VL-8B)
	- task:
	type: video-classification
	name: Temporal Grounding
	dataset:
	type: custom
	name: ET-Bench
	metrics:
	- type: f1
	value: 62.8
	name: TVG F1
	- type: f1
	value: 10.7
	name: EPM F1
	- type: f1
	value: 24.6
	name: TAL F1
	- type: f1
	value: 36.5
	name: DVC F1
	- type: f1
	value: 28.5
	name: SLC F1
	---

	# STRIDE-2B

	STRIDE (Structured Temporal Refinement with Iterative DEnoising) is a lightweight proactive activation model for streaming video understanding.
	It decides when a downstream Video-LLM should respond during a live video stream — without waiting for explicit user queries.

	<p align="center">
	<a href="https://arxiv.org/abs/2603.27593"><img src="https://img.shields.io/badge/arXiv-2603.27593-b31b1b" alt="arXiv"></a>
	<a href="https://interlive-team.github.io/STRIDE"><img src="https://img.shields.io/badge/Project-Page-blue" alt="Project Page"></a>
	<a href="https://github.com/interlive-team/STRIDE"><img src="https://img.shields.io/badge/GitHub-Code-black" alt="GitHub"></a>
	<a href="https://huggingface.co/interlive"><img src="https://img.shields.io/badge/%F0%9F%A4%97-Model_Collection-yellow" alt="HF"></a>
	</p>

	> Paper: STRIDE: When to Speak Meets Sequence Denoising for Streaming Video Understanding
	>
	> Junho Kim\, Hosu Lee\, James M. Rehg, Minsu Kim, Yong Man Ro
	>
	> UIUC, KAIST, Google DeepMind

	## What is STRIDE?

	Existing streaming Video-LLMs are reactive — they only respond when a user explicitly asks a question. STRIDE makes them proactive by adding a lightweight front-end that continuously monitors incoming frames and predicts coherent activation spans indicating when to trigger a response.

	The key insight is that activation in streaming video is not a point-wise binary decision ("should I respond now?"), but a span-structured sequence modeling problem — the model must capture consistent onset (0 → 1), persistence (1 → 1), and offset (1 → 0) transitions. STRIDE achieves this through masked diffusion over a temporal activation window, jointly predicting and iteratively refining activation signals across the window.

	### Two-Stage Architecture

	```
	Video Stream
	│
	▼
	[STRIDE Activation Model] ← this model (2B)
	│
	│ trigger (only if active)
	▼
	[Downstream Video-LLM] ← frozen, any off-the-shelf
	│
	▼
	Response
	```

	- Stage 1 — Activation (STRIDE): Monitors the stream at 1 FPS, maintains a sliding activation window, and iteratively denoises binary activation labels via masked diffusion.
	- Stage 2 — Response (Downstream LLM): When triggered, the frozen downstream Video-LLM receives the accumulated frame cache and generates a response. STRIDE is fully plug-and-play — compatible with any off-the-shelf Video-LLM.

	## Results

	### OVO-Bench (Online Video Understanding)

	\| Method \| Real-Time Perception \| Backward Tracing \| Forward Active Responding \| Overall \|
	\|---\|:---:\|:---:\|:---:\|:---:\|
	\| Flash-VStream-7B \| 28.37 \| 27.38 \| 45.09 \| 33.61 \|
	\| Dispider \| 54.55 \| 36.06 \| 34.72 \| 41.78 \|
	\| TimeChat-Online-7B \| 58.60 \| 42.00 \| 36.40 \| 45.60 \|
	\| QueryStream-7B \| 61.40 \| 42.10 \| 39.03 \| 47.51 \|
	\| StreamAgent-7B \| 61.30 \| 41.70 \| 45.40 \| 49.40 \|
	\| STRIDE + Gemma3-4B \| 60.93 \| 34.87 \| 55.73 \| 50.51 \|
	\| STRIDE + InternVL3-8B \| 67.72 \| 45.23 \| 58.00 \| 56.98 \|
	\| STRIDE + Qwen3-VL-8B \| 69.68 \| 47.83 \| 59.70 \| 59.07 \|

	### StreamingBench (Streaming Comprehension)

	\| Method \| Real-Time Visual \| Omni-Source \| Contextual \| Overall \|
	\|---\|:---:\|:---:\|:---:\|:---:\|
	\| Flash-VStream-7B \| 23.23 \| 26.00 \| 24.12 \| 24.04 \|
	\| VideoLLM-Online-8B \| 35.99 \| 28.45 \| 26.55 \| 32.48 \|
	\| Dispider \| 67.63 \| 35.66 \| 33.61 \| 53.12 \|
	\| StreamAgent-7B \| 74.31 \| 36.26 \| 34.62 \| 57.02 \|
	\| STRIDE + Gemma3-4B \| 60.00 \| 36.80 \| 38.80 \| 50.14 \|
	\| STRIDE + InternVL3-8B \| 72.45 \| 39.20 \| 38.80 \| 57.58 \|
	\| STRIDE + Qwen3-VL-8B \| 74.24 \| 41.30 \| 39.90 \| 59.29 \|

	### ET-Bench (Temporal Grounding, Activation-Only)

	\| Model \| Params \| TVG \| EPM \| TAL \| DVC \| SLC \| Avg \|
	\|---\|:---:\|:---:\|:---:\|:---:\|:---:\|:---:\|:---:\|
	\| Temporal-Localization Specialized \| \| \| \| \| \| \| \|
	\| VTimeLLM \| 7B \| 7.6 \| 1.9 \| 18.2 \| 12.4 \| 8.7 \| 9.8 \|
	\| TimeChat \| 7B \| 26.2 \| 3.9 \| 10.1 \| 16.6 \| 5.6 \| 12.5 \|
	\| VTG-LLM \| 7B \| 15.9 \| 3.7 \| 14.4 \| 40.2 \| 20.8 \| 19.0 \|
	\| LITA \| 13B \| 22.2 \| 4.6 \| 18.0 \| <u>39.7</u> \| 21.0 \| 21.1 \|
	\| ETChat \| 5B \| <u>38.6</u> \| 10.2 \| 30.8 \| 38.4 \| <u>24.4</u> \| <u>28.5</u> \|
	\| Streaming Baselines \| \| \| \| \| \| \| \|
	\| VideoLLM-Online \| 8B \| 13.2 \| 3.8 \| 9.1 \| 24.0 \| 9.9 \| 12.0 \|
	\| Dispider \| 9B \| 36.1 \| 15.5 \| <u>27.3</u> \| 33.8 \| 18.8 \| 26.3 \|
	\| StreamBridge \| 8B \| 34.3 \| – \| 24.3 \| 38.3 \| 22.6 \| – \|
	\| Ours \| \| \| \| \| \| \| \|
	\| STRIDE \| 2B \| 62.8 \| <u>10.7</u> \| 24.6 \| 36.5 \| 28.5 \| 32.6 \|

	STRIDE achieves the best overall average with only 2B parameters, outperforming 7-13B temporal-localization specialized models and streaming baselines.

	## Usage

	For the full streaming inference pipeline and evaluation scripts, please refer to the [STRIDE GitHub repository](https://github.com/interlive-team/STRIDE).

	## Training

	- Architecture: `Qwen3VLForProactiveMDM` (Qwen3-VL backbone with a temporal activation head)
	- Base model: [Qwen/Qwen3-VL-2B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct)
	- Training data: Temporal activation annotations curated from eight publicly available video understanding datasets (ActivityNet-Captions, LITA, YouCook2, ET-Instruct, Charades, CharadesEgo, DiDeMo, Grounded-VideoLLM)
	## Model Variants

	\| Model \| Params \| Description \|
	\|---\|---\|---\|
	\| [STRIDE-2B](https://huggingface.co/interlive/STRIDE-2B) (this) \| 2B \| Default activation model \|
	\| STRIDE-4B \| 4B \| Scaled variant with improved accuracy \|

	## Citation

	```bibtex
	@article{kim2026stride,
	title={STRIDE: When to Speak Meets Sequence Denoising for Streaming Video Understanding},
	author={Kim, Junho and Lee, Hosu and Rehg, James M. and Kim, Minsu and Ro, Yong Man},
	journal={arXiv preprint arXiv:2603.27593},
	year={2026}
	}
	```

	## License

	This model is released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).