File size: 1,755 Bytes

0cb57f8
 
e0a7b5a
 
 
 
 
 
1b96e79
e0a7b5a
 
 
1b96e79
e0a7b5a
 
 
 
 
 
 
 
0cb57f8
1b96e79
e0a7b5a
 
 
 
1b96e79
 
 
 
e0a7b5a
 
 
1b96e79
 
e0a7b5a
 
 
 
 
 
1b96e79
e0a7b5a
 
 
 
 
 
 
 
 
 
 
1b96e79
 
 
 
e0a7b5a
 
 
 
 
1b96e79
 
e0a7b5a
1b96e79
 
 
e0a7b5a
 
 
 
 
1b96e79
 
 
 
e0a7b5a
 
 
1b96e79
e0a7b5a
1b96e79
 
 
 
e0a7b5a
1b96e79

---

license: mit
library_name: pytorch
pipeline_tag: audio-classification
language:
  - sr
  - en
datasets:
  - declare-lab/meld
  - seac
metrics:
  - accuracy
  - weighted-f1
tags:
  - emotion-recognition
  - speech-emotion-recognition
  - audio
  - wav2vec2
  - transfer-learning
  - meld
  - seac
---


# Audio Emotion Recognition (MELD → SEAC, Audio-only)

## Overview

This model performs **speech emotion recognition from audio only**.

It uses a **pretrained Wav2Vec2 encoder (frozen)** as a feature extractor,
followed by a lightweight classification head.

The model was:

- **Pretrained on:** MELD (English conversational emotions)
- **Fine-tuned on:** SEAC (Serbian emotional speech)
- **Task:** 5-class emotion classification from speech audio

---

## Emotions

The model predicts:

- neutral
- joy
- anger
- sadness
- fear

---

## Architecture

- **Encoder:** `facebook/wav2vec2-base` (frozen)
- **Pooling:** Mean pooling over temporal hidden states
- **Classifier:** Fully connected classification head
- **Training strategy:** Transfer learning (classifier-only fine-tuning)

---

## Transfer Learning Setup

**Stage 1 – Pretraining (MELD)**
- Audio-only emotion classification

**Stage 2 – Fine-tuning (SEAC)**
- Encoder frozen
- Only classification head updated

---

## Evaluation (SEAC Test Set)

| Metric        | Score |
|---------------|-------|
| Accuracy      | **0.7107** |
| Weighted F1   | **0.7130** |

---

## Notes

- Sampling rate: 16 kHz
- Mean temporal pooling is used to obtain utterance-level embeddings.
- The released weights include only the classification head.
  The encoder is loaded from `facebook/wav2vec2-base`.

---