--- language: - en metrics: - confusion_matrix - accuracy base_model: - openai/whisper-small pipeline_tag: audio-text-to-text tags: - Audio - ASR - Speech-to-text - Text-to-sentimentClassification license: cc-by-4.0 datasets: - InfoBayAI/call_center_audio_dual_channel_en_in - InfoBayAI/English-Podcast-ASR-Dataset - InfoBayAI/Hindi-Podcast-ASR-Dataset - InfoBayAI/call_center_audio_dual_channel_en_uk --- **Model Description** This model is a transformer-based sentiment classification system built using **DistilBERT** and trained on text data derived from the [InfoBay.AI](https://huggingface.co/collections/InfoBayAI/podcast-speech-and-conversational-audio-datasets) audio dataset. The training pipeline converts raw conversational audio into structured text using **Whisper base**, followed by segmentation and sentiment labeling. The resulting text dataset is then used to train the sentiment classification model. This approach enables the transformation of unstructured audio data into meaningful NLP intelligence, demonstrating the value of the dataset for downstream AI applications. ![infobay_pipeline](https://cdn-uploads.huggingface.co/production/uploads/693ab313ff1770594f99afee/gtuUnCIpOKLDtc1QBILuR.png) **Training Pipeline** The complete pipeline used for training is as follows: **Raw Audio (InfoBay.AI Dataset) → Whisper ASR (Speech-to-Text) → Text Segmentation → Sentiment Labeling → DistilBERT Training** Audio Source: InfoBay.AI podcast dataset Transcription: Whisper base model Data Processing: Sentence-level segmentation Labeling: VADER-based sentiment scoring Model Training: DistilBERT for 3-class sentiment classification **Key Insight** This model demonstrates that audio data alone can be converted into high-quality training data and used effectively to train transformer-based NLP models. It validates the ability of the [InfoBay.AI](https://infobay.ai/) dataset to support: Speech-to-text pipelines Sentiment analysis systems End-to-end conversational AI workflows **Dataset Split** Train/Test Split: 80% / 20% Split Strategy: Stratified sampling (to preserve class distribution) Label Encoding: Applied using LabelEncoder **Training Hyperparameters** Number of Epochs: 15 Train Batch Size: 16 Evaluation Batch Size: 16 Learning Rate: 2e-5 Optimizer: AdamW Loss Function: Cross-Entropy Loss ogging Directory: ./logs Output Directory: ./results **Model Performance** The model demonstrates strong performance on the speech-derived dataset on internal evaluation: Accuracy: ~98% Macro F1-score: ~0.98 Weighted F1-score: ~0.99 **Classification Report** | Class | Sentiment | Precision | Recall | F1-score | Support | | ----- | --------- | --------- | ------ | -------- | ------- | | 0 | Negative | 0.97 | 0.96 | 0.96 | 1,128 | | 1 | Neutral | 0.99 | 0.99 | 0.99 | 7,865 | | 2 | Positive | 0.98 | 0.98 | 0.98 | 2,658 | --- **Evaluation Results** The model was evaluated using standard speech recognition metrics: Word Error Rate (WER): 9.172% Character Error Rate (CER): 4.53% --- **Usage** Install dependencies ```bash pip install -U transformers torch ``` ```python from transformers import DistilBertTokenizer, DistilBertForSequenceClassification import torch import torch.nn.functional as F repo_id = "InfoBayAI/Audio-to-Sentiment_Intelligence_Model" tokenizer = DistilBertTokenizer.from_pretrained( repo_id, subfolder="sentiment-model" ) model = DistilBertForSequenceClassification.from_pretrained( repo_id, subfolder="sentiment-model" ) model.eval() text = " Write your text " inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True) with torch.no_grad(): outputs = model(**inputs) probs = torch.nn.functional.softmax(outputs.logits, dim=1) predicted_class = torch.argmax(probs, dim=1).item() labels = ["Negative", "Neutral", "Positive"] print("Text:", text) print("Prediction:", labels[predicted_class]) print("Confidence:", probs[0][predicted_class].item()) ``` **AUDIO-TO-TEXT** ```python import whisper import pandas as pd from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer from transformers import pipeline import os import numpy as np model = whisper.load_model("base") audio_folder = r"C:\Users\3\Documents\AUDIO 2\b6" print(os.path.exists(audio_Folder)) analyzer = SentimentIntensityAnalyzer() data = [] sr = 1 # Loop through all audio files for file in os.listdir(audio_folder): if file.endswith((".wav", ".mp3")): audio_path = os.path.join(audio_folder, file) print("Processing:", file) result = model.transcribe(audio_path, task="translate", fp16=False) segment_id = 1 for segment in result["segments"]: text = segment["text"] # Sentiment score sentiment_score = analyzer.polarity_scores(text)["compound"] # Convert score to label if sentiment_score > 0.05: sentiment = "positive" elif sentiment_score < -0.05: sentiment = "negative" else: sentiment = "neutral" data.append({ "sr_no": sr, "call_id": file, "segment_id": segment_id, "start_time": segment["start"], "end_time": segment["end"], "text": text, "sentiment": sentiment }) sr += 1 segment_id += 1 df= pd.DataFrame(data) df.to_csv("AUDIO", index=False) print("dataset created ") print(df.head()) ``` --- **Considerations** This model is trained on text derived from the InfoBay.AI audio dataset and is provided for research and evaluation purposes. The dataset contains a larger collection of high-quality conversational audio. For access to the full dataset or enterprise licensing inquiries, please visit our website [InfoBay.AI](https://infobay.ai/) or contact us directly. Ph: (91) 8303174762 Email: datareq@infobay.ai