Update pipeline tag - VAD

e903286 verified 8 months ago

7.57 kB

	---
	language: it
	license: apache-2.0
	library_name: onnxruntime
	pipeline_tag: voice-activity-detection
	tags:
	- turn-detection
	- end-of-utterance
	- distilbert
	- onnx
	- quantized
	- conversational-ai
	- voice-assistant
	- real-time
	base_model: distilbert-base-multilingual-cased
	datasets:
	- videosdk-live/Namo-Turn-Detector-v1-Train
	model-index:
	- name: Namo Turn Detector v1 - Italian
	results:
	- task:
	type: text-classification
	name: Turn Detection
	dataset:
	name: Namo Turn Detector v1 Test - Italian
	type: videosdk-live/Namo-Turn-Detector-v1-Test
	split: train
	metrics:
	- type: accuracy
	value: 0.868286
	name: Accuracy
	- type: f1
	value: 0.880925
	name: F1 Score
	- type: precision
	value: 0.80042
	name: Precision
	- type: recall
	value: 0.979434
	name: Recall
	---

	# 🎯 Namo Turn Detector v1 - Italian

	<div align="center">

	[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
	[![ONNX](https://img.shields.io/badge/ONNX-Optimized-brightgreen)](https://onnx.ai/)
	[![Model Size](https://img.shields.io/badge/Model%20Size-~136M-orange)](https://huggingface.co/videosdk-live/Namo-Turn-Detector-v1-Italian)
	[![Inference Speed](https://img.shields.io/badge/Inference-<13ms-red)]()

	🚀 Namo Turn Detection Model for Italian

	</div>

	---

	## 📋 Overview

	The Namo Turn Detector is a specialized AI model designed to solve one of the most challenging problems in conversational AI: knowing when a user has finished speaking.

	This Italian-specialist model uses advanced natural language understanding to distinguish between:
	- ✅ Complete utterances (user is done speaking)
	- 🔄 Incomplete utterances (user will continue speaking)

	Built on DistilBERT architecture and optimized with quantized ONNX format, it delivers enterprise-grade performance with minimal latency.

	## 🔑 Key Features

	- Turn Detection Specialist: Detects end-of-turn vs. continuation in Italian speech transcripts.
	- Low Latency: Optimized with quantized ONNX for <13ms inference.
	- Robust Performance: 86.8% accuracy on diverse Italian utterances.
	- Easy Integration: Compatible with Python, ONNX Runtime, and VideoSDK Agents SDK.
	- Enterprise Ready: Supports real-time conversational AI and voice assistants.

	## 📊 Performance Metrics
	<div>

	\| Metric \| Score \|
	\|--------\|-------\|
	\| 🎯 Accuracy \| 86.82% \|
	\| 📈 F1-Score \| 88.09% \|
	\| 🎪 Precision \| 80.04% \|
	\| 🎭 Recall \| 97.94% \|
	\| ⚡ Latency \| <13ms \|
	\| 💾 Model Size \| ~135MB \|

	</div>
	<img src="./confusion_matrices.png" alt="Alt text" width="600" height="400"/>

	> 📊 Evaluated on 700+ Italian utterances from diverse conversational contexts

	## ⚡️ Speed Analysis

	<img src="./performance_analysis.png" alt="Alt text" width="600" height="400"/>

	## 🔧 Train & Test Scripts

	<div align="center">

	[![Train Script](https://img.shields.io/badge/Colab-Train%20Script-brightgreen?logo=google-colab)](https://colab.research.google.com/drive/1DqSUYfcya0r2iAEZB9fS4mfrennubduV) [![Test Script](https://img.shields.io/badge/Colab-Test%20Script-blue?logo=google-colab)](https://colab.research.google.com/drive/19ZOlNoHS2WLX2V4r5r492tsCUnYLXnQR)

	</div>

	## 🛠️ Installation

	To use this model, you will need to install the following libraries.

	```bash
	pip install onnxruntime transformers huggingface_hub
	```

	## 🚀 Quick Start

	You can run inference directly from Hugging Face repository.

	```python
	import numpy as np
	import onnxruntime as ort
	from transformers import AutoTokenizer
	from huggingface_hub import hf_hub_download

	class TurnDetector:
	def __init__(self, repo_id="videosdk-live/Namo-Turn-Detector-v1-Italian"):
	"""
	Initializes the detector by downloading the model and tokenizer
	from the Hugging Face Hub.
	"""
	print(f"Loading model from repo: {repo_id}")

	# Download the model and tokenizer from the Hub
	# Authentication is handled automatically if you are logged in
	model_path = hf_hub_download(repo_id=repo_id, filename="model_quant.onnx")
	self.tokenizer = AutoTokenizer.from_pretrained(repo_id)

	# Set up the ONNX Runtime inference session
	self.session = ort.InferenceSession(model_path)
	self.max_length = 512
	print("✅ Model and tokenizer loaded successfully.")

	def predict(self, text: str) -> tuple:
	"""
	Predicts if a given text utterance is the end of a turn.
	Returns (predicted_label, confidence) where:
	- predicted_label: 0 for "Not End of Turn", 1 for "End of Turn"
	- confidence: confidence score between 0 and 1
	"""
	# Tokenize the input text
	inputs = self.tokenizer(
	text,
	truncation=True,
	max_length=self.max_length,
	return_tensors="np"
	)

	# Prepare the feed dictionary for the ONNX model
	feed_dict = {
	"input_ids": inputs["input_ids"],
	"attention_mask": inputs["attention_mask"]
	}

	# Run inference
	outputs = self.session.run(None, feed_dict)
	logits = outputs[0]

	probabilities = self._softmax(logits[0])
	predicted_label = np.argmax(probabilities)
	confidence = float(np.max(probabilities))

	return predicted_label, confidence

	def _softmax(self, x, axis=None):
	if axis is None:
	axis = -1
	exp_x = np.exp(x - np.max(x, axis=axis, keepdims=True))
	return exp_x / np.sum(exp_x, axis=axis, keepdims=True)

	# --- Example Usage ---
	if __name__ == "__main__":
	detector = TurnDetector()

	sentences = [
	"È stato spesso visto tirar fuori i favi dai grandi nidi di vespe e di api, così come dai nidi più piccoli dei calabroni.", # Expected: End of Turn
	"L'opera che ne falsi del tutto lo spirito è?", # Expected: Not End of Turn
	]

	for sentence in sentences:
	predicted_label, confidence = detector.predict(sentence)
	result = "End of Turn" if predicted_label == 1 else "Not End of Turn"
	print(f"'{sentence}' -> {result} (confidence: {confidence:.3f})")
	print("-" * 50)

	```


	## 🤖 VideoSDK Agents Integration

	Integrate this turn detector directly with VideoSDK Agents for production-ready conversational AI applications.

	```python
	from videosdk_agents import NamoTurnDetectorV1, pre_download_namo_turn_v1_model

	#download model
	pre_download_namo_turn_v1_model(language="it")

	# Initialize Italian turn detector for VideoSDK Agents
	turn_detector = NamoTurnDetectorV1(language="it")
	```

	> 📚 [Complete Integration Guide](https://docs.videosdk.live/ai_agents/plugins/namo-turn-detector) - Learn how to use `NamoTurnDetectorV1` with VideoSDK Agents

	## 📖 Citation

	```bibtex
	@model{namo_turn_detector_it_2025,
	title={Namo Turn Detector v1: Italian},
	author={VideoSDK Team},
	year={2025},
	publisher={Hugging Face},
	url={https://huggingface.co/videosdk-live/Namo-Turn-Detector-v1-Italian},
	note={ONNX-optimized DistilBERT for turn detection in Italian}
	}
	```

	## 📄 License

	This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.

	<div align="center">

	Made with ❤️ by the VideoSDK Team

	[![VideoSDK](https://img.shields.io/badge/VideoSDK-Live-blue)](https://videosdk.live)

	</div>