Audio Emotion Recognition (MELD β†’ SEAC, Audio-only)

Overview

This model performs speech emotion recognition from audio only using a frozen pretrained speech encoder and a lightweight classifier.

The system is trained using cross-dataset transfer learning:

  • Pretrained on: MELD (English conversational emotions)
  • Fine-tuned on: SEAC (Serbian emotional speech)
  • Task: 5-class speech emotion classification

Emotions

The model predicts the following emotions:

  • neutral
  • joy
  • anger
  • sadness
  • fear

Architecture

  • Encoder: facebook/wav2vec2-base (frozen feature extractor)
  • Temporal pooling: Mean + Standard Deviation pooling
  • Classifier: Fully connected classification head
  • Loss: Weighted Cross-Entropy (handles class imbalance)
  • Training strategy: Transfer learning (classifier-only fine-tuning)

Temporal Pooling

To obtain stable utterance-level representations, the model applies:

Mean pooling + Standard deviation pooling over temporal hidden states.

This improves robustness compared to simple mean pooling by capturing both average signal content and temporal variability.


Transfer Learning Setup

Stage 1 β€” Pretraining (MELD)

  • Audio-only emotion recognition
  • Encoder frozen
  • Classifier trained on MELD emotional speech

Stage 2 β€” Fine-tuning (SEAC)

  • Encoder remains frozen
  • Classifier fine-tuned on Serbian speech
  • Class-weighted loss used to address imbalance
  • Temporal pooling applied

Evaluation (SEAC Test Set)

Metric Score
Accuracy 0.7107
Weighted F1 0.7130

Notes

  • Sampling rate: 16 kHz
  • Encoder weights are loaded from facebook/wav2vec2-base
  • The released checkpoint contains only the classification head
  • Temporal pooling (mean + std) improves stability over standard mean pooling
  • Class-weighted loss improves performance on minority emotions

Downloads last month
4
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support