PetraMicanovic commited on
Commit
1b96e79
·
1 Parent(s): 2ded58d

Change README.md

Browse files
Files changed (1) hide show
  1. README.md +29 -39
README.md CHANGED
@@ -6,11 +6,11 @@ language:
6
  - sr
7
  - en
8
  datasets:
9
- - meld
10
  - seac
11
  metrics:
12
  - accuracy
13
- - f1
14
  tags:
15
  - emotion-recognition
16
  - speech-emotion-recognition
@@ -19,26 +19,28 @@ tags:
19
  - transfer-learning
20
  - meld
21
  - seac
22
-
23
  ---
 
24
  # Audio Emotion Recognition (MELD → SEAC, Audio-only)
25
 
26
  ## Overview
27
 
28
- This model performs **speech emotion recognition from audio only**.
29
- It is based on a **pretrained Wav2Vec2 encoder (frozen)** with a lightweight audio classification head.
 
 
30
 
31
  The model was:
32
 
33
- - **Pretrained on:** MELD dataset (English, conversational emotions)
34
- - **Fine-tuned on:** SEAC dataset (Serbian emotional speech)
35
  - **Task:** 5-class emotion classification from speech audio
36
 
37
  ---
38
 
39
  ## Emotions
40
 
41
- The model predicts the following emotions:
42
 
43
  - neutral
44
  - joy
@@ -50,50 +52,38 @@ The model predicts the following emotions:
50
 
51
  ## Architecture
52
 
53
- - **Encoder:** Wav2Vec2 (frozen, used as feature extractor)
54
- - **Pooling:** Mean pooling over hidden states
55
- - **Classifier:** Fully connected audio emotion head
56
- - **Loss:** Class-weighted CrossEntropy (handles class imbalance)
57
- - **Optimizer:** AdamW
58
- - **LR Scheduler:** ReduceLROnPlateau
59
- - **Early stopping:** Enabled
60
 
61
  ---
62
 
63
  ## Transfer Learning Setup
64
 
65
- The training followed a **cross-dataset transfer learning** setup:
 
66
 
67
- **Step 1 Pretraining**
68
- - Model trained on MELD (audio-only)
69
-
70
- **Step 2 — Fine-tuning**
71
- - Model adapted to SEAC Serbian emotional speech
72
- - Encoder kept frozen
73
- - Only classification head trained
74
 
75
  ---
76
 
77
  ## Evaluation (SEAC Test Set)
78
 
79
- | Metric | Score |
80
- |--------|-------|
81
- | Accuracy | **0.7107** |
82
- | Weighted F1 | **0.7130** |
83
-
84
- ### Per-class behavior
85
-
86
- - Best recognized: **fear, neutral**
87
- - Good performance: **joy, sadness**
88
- - Hardest class: **anger** (confused mostly with fear)
89
 
90
  ---
91
 
92
- ## Usage
93
 
94
- ```python
95
- import torch
 
 
96
 
97
- model.load_state_dict(torch.load("audio_model.pt", map_location="cpu"))
98
- model.eval()
99
- ```
 
6
  - sr
7
  - en
8
  datasets:
9
+ - declare-lab/meld
10
  - seac
11
  metrics:
12
  - accuracy
13
+ - weighted-f1
14
  tags:
15
  - emotion-recognition
16
  - speech-emotion-recognition
 
19
  - transfer-learning
20
  - meld
21
  - seac
 
22
  ---
23
+
24
  # Audio Emotion Recognition (MELD → SEAC, Audio-only)
25
 
26
  ## Overview
27
 
28
+ This model performs **speech emotion recognition from audio only**.
29
+
30
+ It uses a **pretrained Wav2Vec2 encoder (frozen)** as a feature extractor,
31
+ followed by a lightweight classification head.
32
 
33
  The model was:
34
 
35
+ - **Pretrained on:** MELD (English conversational emotions)
36
+ - **Fine-tuned on:** SEAC (Serbian emotional speech)
37
  - **Task:** 5-class emotion classification from speech audio
38
 
39
  ---
40
 
41
  ## Emotions
42
 
43
+ The model predicts:
44
 
45
  - neutral
46
  - joy
 
52
 
53
  ## Architecture
54
 
55
+ - **Encoder:** `facebook/wav2vec2-base` (frozen)
56
+ - **Pooling:** Mean pooling over temporal hidden states
57
+ - **Classifier:** Fully connected classification head
58
+ - **Training strategy:** Transfer learning (classifier-only fine-tuning)
 
 
 
59
 
60
  ---
61
 
62
  ## Transfer Learning Setup
63
 
64
+ **Stage 1 Pretraining (MELD)**
65
+ - Audio-only emotion classification
66
 
67
+ **Stage 2 Fine-tuning (SEAC)**
68
+ - Encoder frozen
69
+ - Only classification head updated
 
 
 
 
70
 
71
  ---
72
 
73
  ## Evaluation (SEAC Test Set)
74
 
75
+ | Metric | Score |
76
+ |---------------|-------|
77
+ | Accuracy | **0.7107** |
78
+ | Weighted F1 | **0.7130** |
 
 
 
 
 
 
79
 
80
  ---
81
 
82
+ ## Notes
83
 
84
+ - Sampling rate: 16 kHz
85
+ - Mean temporal pooling is used to obtain utterance-level embeddings.
86
+ - The released weights include only the classification head.
87
+ The encoder is loaded from `facebook/wav2vec2-base`.
88
 
89
+ ---