File size: 1,755 Bytes
0cb57f8
 
e0a7b5a
 
 
 
 
 
1b96e79
e0a7b5a
 
 
1b96e79
e0a7b5a
 
 
 
 
 
 
 
0cb57f8
1b96e79
e0a7b5a
 
 
 
1b96e79
 
 
 
e0a7b5a
 
 
1b96e79
 
e0a7b5a
 
 
 
 
 
1b96e79
e0a7b5a
 
 
 
 
 
 
 
 
 
 
1b96e79
 
 
 
e0a7b5a
 
 
 
 
1b96e79
 
e0a7b5a
1b96e79
 
 
e0a7b5a
 
 
 
 
1b96e79
 
 
 
e0a7b5a
 
 
1b96e79
e0a7b5a
1b96e79
 
 
 
e0a7b5a
1b96e79
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
---

license: mit
library_name: pytorch
pipeline_tag: audio-classification
language:
  - sr
  - en
datasets:
  - declare-lab/meld
  - seac
metrics:
  - accuracy
  - weighted-f1
tags:
  - emotion-recognition
  - speech-emotion-recognition
  - audio
  - wav2vec2
  - transfer-learning
  - meld
  - seac
---


# Audio Emotion Recognition (MELD → SEAC, Audio-only)

## Overview

This model performs **speech emotion recognition from audio only**.

It uses a **pretrained Wav2Vec2 encoder (frozen)** as a feature extractor,
followed by a lightweight classification head.

The model was:

- **Pretrained on:** MELD (English conversational emotions)
- **Fine-tuned on:** SEAC (Serbian emotional speech)
- **Task:** 5-class emotion classification from speech audio

---

## Emotions

The model predicts:

- neutral
- joy
- anger
- sadness
- fear

---

## Architecture

- **Encoder:** `facebook/wav2vec2-base` (frozen)
- **Pooling:** Mean pooling over temporal hidden states
- **Classifier:** Fully connected classification head
- **Training strategy:** Transfer learning (classifier-only fine-tuning)

---

## Transfer Learning Setup

**Stage 1 – Pretraining (MELD)**
- Audio-only emotion classification

**Stage 2 – Fine-tuning (SEAC)**
- Encoder frozen
- Only classification head updated

---

## Evaluation (SEAC Test Set)

| Metric        | Score |
|---------------|-------|
| Accuracy      | **0.7107** |
| Weighted F1   | **0.7130** |

---

## Notes

- Sampling rate: 16 kHz
- Mean temporal pooling is used to obtain utterance-level embeddings.
- The released weights include only the classification head.
  The encoder is loaded from `facebook/wav2vec2-base`.

---