tomasBernal commited on
Commit
8a103d2
·
verified ·
1 Parent(s): 39033af

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +228 -0
README.md ADDED
@@ -0,0 +1,228 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: mit
5
+ library_name: transformers
6
+ pipeline_tag: audio-classification
7
+ tags:
8
+ - emotion-recognition
9
+ - speech-emotion-recognition
10
+ - multimodal-learning
11
+ - audio-classification
12
+ - speech-processing
13
+ - text-processing
14
+ - english
15
+ - affective-computing
16
+ - umuteam
17
+ datasets:
18
+ - RAVDESS
19
+ - TESS
20
+ - MELD
21
+ metrics:
22
+ - accuracy
23
+ - f1
24
+
25
+ model-index:
26
+ - name: UMUTeam/w2v-bert-beto-mean-emotion-en
27
+ results:
28
+ - task:
29
+ type: audio-classification
30
+ name: Multimodal Speech Emotion Recognition
31
+ dataset:
32
+ name: English Multimodal Emotion Recognition Benchmark
33
+ type: custom
34
+ metrics:
35
+ - type: accuracy
36
+ value: 90.2870
37
+ name: Accuracy
38
+ - type: weighted-f1
39
+ value: 90.2334
40
+ name: Weighted F1
41
+ - type: macro-f1
42
+ value: 90.2589
43
+ name: Macro F1
44
+ ---
45
+
46
+ # UMUTeam/w2v-bert-beto-mean-emotion-en
47
+
48
+ ## Model description
49
+
50
+ `UMUTeam/w2v-bert-beto-mean-emotion-en` is an English multimodal emotion recognition model developed as part of **speech-emotion**, an open-source multilingual and multimodal toolkit for emotion recognition from speech, text, and multimodal inputs.
51
+
52
+ This model performs **multimodal emotion classification from English speech and text inputs**.
53
+
54
+ The model combines acoustic representations extracted with Wav2Vec2-BERT and linguistic representations generated with RoBERTa using a mean fusion multimodal strategy.
55
+
56
+ It is designed to jointly exploit complementary emotional information from speech and text in order to improve emotion recognition performance compared to unimodal approaches.
57
+
58
+ The model predicts one of the following emotion labels:
59
+
60
+ - `angry`
61
+ - `disgust`
62
+ - `fear`
63
+ - `happy`
64
+ - `neutral`
65
+ - `sad`
66
+ - `surprise`
67
+
68
+ ## Intended use
69
+
70
+ This model is intended for research and applied scenarios involving multimodal emotion recognition in English, such as:
71
+
72
+ - multimodal conversational analysis
73
+ - speech and text emotion analysis
74
+ - affective computing research
75
+ - emotion-aware conversational systems
76
+ - human-computer interaction
77
+ - multimodal AI research
78
+
79
+ The model is particularly useful in scenarios where both speech audio and transcribed text are available.
80
+
81
+ It can be used through the `speech-emotion` toolkit.
82
+
83
+ ## Out-of-scope use
84
+
85
+ This model should not be used as the sole basis for high-stakes decisions, including but not limited to:
86
+
87
+ - clinical diagnosis
88
+ - mental health assessment
89
+ - employment, legal, or educational decisions
90
+ - biometric profiling or surveillance
91
+ - automated decisions affecting individuals without human oversight
92
+
93
+ Emotion recognition is inherently uncertain and context-dependent. Predictions should be interpreted as model estimates, not as definitive assessments of a person's emotional state.
94
+
95
+ ## Training data
96
+
97
+ The model was trained on the English multimodal datasets used in the `speech-emotion` project.
98
+
99
+ The training data combines multiple publicly available English speech and multimodal emotion recognition datasets, including:
100
+
101
+ - RAVDESS
102
+ - TESS
103
+ - MELD
104
+ - datasets derived from prior speech emotion recognition research benchmarks
105
+
106
+ Because the original datasets use different emotion taxonomies, all datasets were harmonized into a unified seven-class emotion taxonomy:
107
+
108
+ - `angry`
109
+ - `disgust`
110
+ - `fear`
111
+ - `happy`
112
+ - `neutral`
113
+ - `sad`
114
+ - `surprise`
115
+
116
+ For the English multimodal emotion recognition setup, the same aligned speech-text samples were used for both the acoustic and textual modalities:
117
+
118
+ - Training samples: 3,622
119
+ - Validation samples: 453
120
+ - Test samples: 453
121
+
122
+ More details about the dataset preprocessing and label harmonization pipeline are available in the project repository:
123
+
124
+ https://github.com/NLP-UMUTeam/umuteam-speech-emotion
125
+
126
+ ## Evaluation
127
+
128
+ The model was evaluated on the English held-out test set used in the `speech-emotion` toolkit.
129
+
130
+ ### Performance comparison on English emotion recognition
131
+
132
+ | Configuration | Accuracy | Weighted Precision | Weighted F1 | Macro F1 |
133
+ |---|---:|---:|---:|---:|
134
+ | Speech-only | 95.1435 | 95.2700 | 95.1575 | 95.1679 |
135
+ | Text-only | 76.0842 | 75.5723 | 75.6852 | 68.0266 |
136
+ | Multimodal (Concat) | **96.0462** | **96.0880** | **96.0257** | **96.0462** |
137
+ | Multimodal (Mean) | 90.2870 | 90.5162 | 90.2334 | 90.2589 |
138
+ | Multimodal (Multihead) | 93.1567 | 93.2715 | 93.1898 | 93.2115 |
139
+
140
+ The results show that combining acoustic and linguistic representations improves emotion recognition performance compared to unimodal speech-only or text-only systems.
141
+
142
+ The mean fusion strategy provides competitive multimodal performance while maintaining a simpler fusion mechanism compared to more complex attention-based architectures.
143
+
144
+ ## How to use
145
+
146
+ Install the toolkit:
147
+
148
+ ```bash
149
+ pip install speech-emotion
150
+ ```
151
+
152
+ ### Multimodal emotion recognition using audio and text
153
+
154
+ ```python
155
+ from speech_emotion import predict_emotion
156
+
157
+ emotion = predict_emotion(
158
+ audio_path="audio.wav",
159
+ text="I was really happy to see you again.",
160
+ language="en",
161
+ mode="mean",
162
+ model_config_path="model.json"
163
+ )
164
+
165
+ print("Detected emotion:", emotion)
166
+ ```
167
+
168
+ ### Multimodal emotion recognition using automatic transcription (Whisper)
169
+
170
+ If no transcription is provided, the toolkit can automatically generate it using Whisper before performing emotion recognition.
171
+
172
+ ```python
173
+ from speech_emotion import predict_emotion
174
+
175
+ emotion = predict_emotion(
176
+ audio_path="audio.wav",
177
+ language="en",
178
+ mode="mean",
179
+ model_config_path="model.json"
180
+ )
181
+
182
+ print("Detected emotion:", emotion)
183
+ ```
184
+
185
+ Repository:
186
+
187
+ https://github.com/NLP-UMUTeam/umuteam-speech-emotion
188
+
189
+ ## Limitations
190
+
191
+ - The model is designed for English multimodal emotion recognition and may not generalize reliably to other languages.
192
+ - It predicts a single label from a fixed set of seven emotions.
193
+ - Emotion expression is subjective and highly context-dependent.
194
+ - Performance may decrease with noisy audio, inaccurate transcriptions, overlapping speakers, or domain shifts.
195
+ - The model assumes that audio and text inputs are semantically aligned.
196
+ - Errors in automatic speech transcription may negatively affect multimodal performance.
197
+
198
+ ## Bias and ethical considerations
199
+
200
+ Emotion recognition systems may reflect biases present in their training data, including differences related to accents, speaking styles, demographics, recording conditions, or annotation subjectivity.
201
+
202
+ Users should avoid interpreting predictions as objective truths about a person's internal emotional state. The model should be used with transparency, appropriate consent, and human oversight, especially in sensitive contexts.
203
+
204
+ ## Citation
205
+
206
+ If you use this model in your research, please cite the following works:
207
+
208
+ ### speech-emotion toolkit
209
+
210
+ ```bibtex
211
+ @article{PAN2026102677,
212
+ title = {speech-emotion: A multilingual and multimodal toolkit for emotion recognition from speech},
213
+ journal = {SoftwareX},
214
+ volume = {34},
215
+ pages = {102677},
216
+ year = {2026},
217
+ issn = {2352-7110},
218
+ doi = {https://doi.org/10.1016/j.softx.2026.102677},
219
+ url = {https://www.sciencedirect.com/science/article/pii/S235271102600169X},
220
+ author = {Ronghao Pan and Tomás Bernal-Beltrán and José Antonio García-Díaz and Rafael Valencia-García},
221
+ }
222
+ ```
223
+
224
+ ## Acknowledgments
225
+
226
+ This work is part of the research project LaTe4PoliticES (PID2022-138099OB-I00), funded by MICIU/AEI/10.13039/501100011033 and the European Regional Development Fund (ERDF/EU - FEDER/UE), “A way of making Europe”.
227
+
228
+ Mr. Tomás Bernal-Beltrán is supported by the University of Murcia through the predoctoral programme.