---
title: fastvlm-0.5b-unity
emoji: 🎬
colorFrom: blue
colorTo: green
sdk: static
pinned: false
license: mit
short_description: Real-time scene captioning with FastVLM ONNX on Unity Sentis
---

# FastVLM 0.5B for Unity Sentis

This repository is a Unity 6 + Sentis (`com.unity.ai.inference`) demo for FastVLM-based scene captioning.

## Demo (YouTube)

[![FastVLM Unity](https://img.youtube.com/vi/zyDNLEEXR0Q/0.jpg)](https://www.youtube.com/watch?v=zyDNLEEXR0Q)

## Environment

- **Unity Version**: `6000.3.6f1`
- **Sentis Version**: `com.unity.ai.inference 2.5.0` (customized)
  - **Custom layers added to the ONNX converter**:
    `RotaryEmbedding`, `GroupQueryAttention`, `SimplifiedLayerNormalization`, `SkipSimplifiedLayerNormalization`
  - **Implementation file**:
    `fastvlm-0.5b-unity/Packages/com.unity.ai.inference/Editor/ONNX/ONNXModelConverter.cs`

## Project Structure

- `Assets/FastVLM/FastVLMScene.unity`: Main runtime scene
- `Assets/FastVLM/VLMController.cs`: VideoPlayer-UI bridge and continuous inference loop
- `Assets/FastVLM/ModelVLM.cs`: Model initialization, vision/text embedding composition, and generation
- `Assets/FastVLM/Qwen2Tokenizer.cs`: Qwen2 BPE tokenizer
- `Assets/StreamingAssets/fastvlm/`: `vocab.json`, `merges.txt`, `tokenizer_config.json`

## Required Model Files

Prepare the ONNX files below in `Assets/FastVLM/Models/` and assign them to the `ModelVLM` component in `VLMManager`.

Source models:  
https://huggingface.co/onnx-community/FastVLM-0.5B-ONNX/tree/main/onnx

Download the three files below from the link above, then copy them into `Assets/FastVLM/Models/`.

- `vision_encoder.onnx`
- `embed_tokens.onnx`
- `decoder_model_merged.onnx`

## Quick Start

1. Open the project in Unity `6000.3.6f1`.
2. Open `Assets/FastVLM/FastVLMScene.unity`.
3. Check `VLMManager > ModelVLM` and verify all `ModelAsset` fields are assigned.
4. Hit Play.
5. Edit the prompt in `InputField` if needed. The next loop uses the updated prompt.

## License

MIT