| ---
|
| license: mit
|
| library_name: pytorch
|
| pipeline_tag: audio-classification
|
| language:
|
| - sr
|
| - en
|
| datasets:
|
| - declare-lab/meld
|
| - seac
|
| metrics:
|
| - accuracy
|
| - weighted-f1
|
| tags:
|
| - emotion-recognition
|
| - speech-emotion-recognition
|
| - audio
|
| - wav2vec2
|
| - transfer-learning
|
| - meld
|
| - seac
|
| ---
|
|
|
| # Audio Emotion Recognition (MELD → SEAC, Audio-only)
|
|
|
| ## Overview
|
|
|
| This model performs **speech emotion recognition from audio only**.
|
|
|
| It uses a **pretrained Wav2Vec2 encoder (frozen)** as a feature extractor,
|
| followed by a lightweight classification head.
|
|
|
| The model was:
|
|
|
| - **Pretrained on:** MELD (English conversational emotions)
|
| - **Fine-tuned on:** SEAC (Serbian emotional speech)
|
| - **Task:** 5-class emotion classification from speech audio
|
|
|
| ---
|
|
|
| ## Emotions
|
|
|
| The model predicts:
|
|
|
| - neutral
|
| - joy
|
| - anger
|
| - sadness
|
| - fear
|
|
|
| ---
|
|
|
| ## Architecture
|
|
|
| - **Encoder:** `facebook/wav2vec2-base` (frozen)
|
| - **Pooling:** Mean pooling over temporal hidden states
|
| - **Classifier:** Fully connected classification head
|
| - **Training strategy:** Transfer learning (classifier-only fine-tuning)
|
|
|
| ---
|
|
|
| ## Transfer Learning Setup
|
|
|
| **Stage 1 – Pretraining (MELD)**
|
| - Audio-only emotion classification
|
|
|
| **Stage 2 – Fine-tuning (SEAC)**
|
| - Encoder frozen
|
| - Only classification head updated
|
|
|
| ---
|
|
|
| ## Evaluation (SEAC Test Set)
|
|
|
| | Metric | Score |
|
| |---------------|-------|
|
| | Accuracy | **0.7107** |
|
| | Weighted F1 | **0.7130** |
|
|
|
| ---
|
|
|
| ## Notes
|
|
|
| - Sampling rate: 16 kHz
|
| - Mean temporal pooling is used to obtain utterance-level embeddings.
|
| - The released weights include only the classification head.
|
| The encoder is loaded from `facebook/wav2vec2-base`.
|
|
|
| --- |