Audio Emotion Recognition (MELD β SEAC, Audio-only)
Overview
This model performs speech emotion recognition from audio only.
It uses a pretrained Wav2Vec2 encoder (frozen) as a feature extractor, followed by a lightweight classification head.
The model was:
- Pretrained on: MELD (English conversational emotions)
- Fine-tuned on: SEAC (Serbian emotional speech)
- Task: 5-class emotion classification from speech audio
Emotions
The model predicts:
- neutral
- joy
- anger
- sadness
- fear
Architecture
- Encoder:
facebook/wav2vec2-base(frozen) - Pooling: Mean pooling over temporal hidden states
- Classifier: Fully connected classification head
- Training strategy: Transfer learning (classifier-only fine-tuning)
Transfer Learning Setup
Stage 1 β Pretraining (MELD)
- Audio-only emotion classification
Stage 2 β Fine-tuning (SEAC)
- Encoder frozen
- Only classification head updated
Evaluation (SEAC Test Set)
| Metric | Score |
|---|---|
| Accuracy | 0.7107 |
| Weighted F1 | 0.7130 |
Notes
- Sampling rate: 16 kHz
- Mean temporal pooling is used to obtain utterance-level embeddings.
- The released weights include only the classification head.
The encoder is loaded from
facebook/wav2vec2-base.
- Downloads last month
- 3