Audio Emotion Recognition (MELD β SEAC, Audio-only)
Overview
This model performs speech emotion recognition from audio only using a frozen pretrained speech encoder and a lightweight classifier.
The system is trained using cross-dataset transfer learning:
- Pretrained on: MELD (English conversational emotions)
- Fine-tuned on: SEAC (Serbian emotional speech)
- Task: 5-class speech emotion classification
Emotions
The model predicts the following emotions:
- neutral
- joy
- anger
- sadness
- fear
Architecture
- Encoder:
facebook/wav2vec2-base(frozen feature extractor) - Temporal pooling: Mean + Standard Deviation pooling
- Classifier: Fully connected classification head
- Loss: Weighted Cross-Entropy (handles class imbalance)
- Training strategy: Transfer learning (classifier-only fine-tuning)
Temporal Pooling
To obtain stable utterance-level representations, the model applies:
Mean pooling + Standard deviation pooling over temporal hidden states.
This improves robustness compared to simple mean pooling by capturing both average signal content and temporal variability.
Transfer Learning Setup
Stage 1 β Pretraining (MELD)
- Audio-only emotion recognition
- Encoder frozen
- Classifier trained on MELD emotional speech
Stage 2 β Fine-tuning (SEAC)
- Encoder remains frozen
- Classifier fine-tuned on Serbian speech
- Class-weighted loss used to address imbalance
- Temporal pooling applied
Evaluation (SEAC Test Set)
| Metric | Score |
|---|---|
| Accuracy | 0.7107 |
| Weighted F1 | 0.7130 |
Notes
- Sampling rate: 16 kHz
- Encoder weights are loaded from
facebook/wav2vec2-base - The released checkpoint contains only the classification head
- Temporal pooling (mean + std) improves stability over standard mean pooling
- Class-weighted loss improves performance on minority emotions
- Downloads last month
- 4