Files changed (1) hide show
  1. README.md +231 -0
README.md ADDED
@@ -0,0 +1,231 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: mit
5
+ library_name: transformers
6
+ pipeline_tag: audio-classification
7
+ tags:
8
+ - emotion-recognition
9
+ - speech-emotion-recognition
10
+ - multimodal-learning
11
+ - audio-classification
12
+ - speech-processing
13
+ - text-processing
14
+ - english
15
+ - affective-computing
16
+ - umuteam
17
+ datasets:
18
+ - RAVDESS
19
+ - TESS
20
+ - MELD
21
+ metrics:
22
+ - accuracy
23
+ - f1
24
+
25
+ model-index:
26
+ - name: UMUTeam/w2v-bert-beto-multihead-emotion-en
27
+ results:
28
+ - task:
29
+ type: audio-classification
30
+ name: Multimodal Speech Emotion Recognition
31
+ dataset:
32
+ name: English Multimodal Emotion Recognition Benchmark
33
+ type: custom
34
+ metrics:
35
+ - type: accuracy
36
+ value: 93.1567
37
+ name: Accuracy
38
+ - type: weighted-f1
39
+ value: 93.1898
40
+ name: Weighted F1
41
+ - type: macro-f1
42
+ value: 93.2115
43
+ name: Macro F1
44
+ ---
45
+
46
+ # UMUTeam/w2v-bert-beto-multihead-emotion-en
47
+
48
+ ## Model description
49
+
50
+ `UMUTeam/w2v-bert-beto-multihead-emotion-en` is an English multimodal emotion recognition model developed as part of **speech-emotion**, an open-source multilingual and multimodal toolkit for emotion recognition from speech, text, and multimodal inputs.
51
+
52
+ This model performs **multimodal emotion classification from English speech and text inputs**.
53
+
54
+ The model combines acoustic representations extracted with Wav2Vec2-BERT and linguistic representations generated with RoBERTa using a multi-head cross-attention fusion strategy.
55
+
56
+ The fusion mechanism uses cross-attention layers to model interactions between acoustic and textual representations, enabling the system to learn relationships between speech and linguistic emotional cues.
57
+
58
+ It is designed to jointly exploit complementary emotional information from speech and text in order to improve emotion recognition performance compared to unimodal approaches.
59
+
60
+ The model predicts one of the following emotion labels:
61
+
62
+ - `angry`
63
+ - `disgust`
64
+ - `fear`
65
+ - `happy`
66
+ - `neutral`
67
+ - `sad`
68
+ - `surprise`
69
+
70
+ ## Intended use
71
+
72
+ This model is intended for research and applied scenarios involving multimodal emotion recognition in English, such as:
73
+
74
+ - multimodal conversational analysis
75
+ - speech and text emotion analysis
76
+ - affective computing research
77
+ - emotion-aware conversational systems
78
+ - human-computer interaction
79
+ - multimodal AI research
80
+
81
+ The model is particularly useful in scenarios where both speech audio and transcribed text are available.
82
+
83
+ It can be used through the `speech-emotion` toolkit.
84
+
85
+ ## Out-of-scope use
86
+
87
+ This model should not be used as the sole basis for high-stakes decisions, including but not limited to:
88
+
89
+ - clinical diagnosis
90
+ - mental health assessment
91
+ - employment, legal, or educational decisions
92
+ - biometric profiling or surveillance
93
+ - automated decisions affecting individuals without human oversight
94
+
95
+ Emotion recognition is inherently uncertain and context-dependent. Predictions should be interpreted as model estimates, not as definitive assessments of a person's emotional state.
96
+
97
+ ## Training data
98
+
99
+ The model was trained on the English multimodal datasets used in the `speech-emotion` project.
100
+
101
+ The training data combines multiple publicly available English speech and multimodal emotion recognition datasets, including:
102
+
103
+ - RAVDESS
104
+ - TESS
105
+ - MELD
106
+ - datasets derived from prior speech emotion recognition research benchmarks
107
+
108
+ Because the original datasets use different emotion taxonomies, all datasets were harmonized into a unified seven-class emotion taxonomy:
109
+
110
+ - `angry`
111
+ - `disgust`
112
+ - `fear`
113
+ - `happy`
114
+ - `neutral`
115
+ - `sad`
116
+ - `surprise`
117
+
118
+ For the English multimodal emotion recognition setup, the same aligned speech-text samples were used for both the acoustic and textual modalities:
119
+
120
+ - Training samples: 3,622
121
+ - Validation samples: 453
122
+ - Test samples: 453
123
+
124
+ More details about the dataset preprocessing and label harmonization pipeline are available in the project repository:
125
+
126
+ https://github.com/NLP-UMUTeam/umuteam-speech-emotion
127
+
128
+ ## Evaluation
129
+
130
+ The model was evaluated on the English held-out test set used in the `speech-emotion` toolkit.
131
+
132
+ ### Performance comparison on English emotion recognition
133
+
134
+ | Configuration | Accuracy | Weighted Precision | Weighted F1 | Macro F1 |
135
+ |---|---:|---:|---:|---:|
136
+ | Speech-only | 95.1435 | 95.2700 | 95.1575 | 95.1679 |
137
+ | Text-only | 76.0842 | 75.5723 | 75.6852 | 68.0266 |
138
+ | Multimodal (Concat) | **96.0462** | **96.0880** | **96.0257** | **96.0462** |
139
+ | Multimodal (Mean) | 90.2870 | 90.5162 | 90.2334 | 90.2589 |
140
+ | Multimodal (Multihead) | 93.1567 | 93.2715 | 93.1898 | 93.2115 |
141
+
142
+ The results show that combining acoustic and linguistic representations improves emotion recognition performance compared to unimodal approaches.
143
+
144
+ The multi-head cross-attention strategy enables interaction between modalities through attention mechanisms and achieves strong multimodal performance for English emotion recognition.
145
+
146
+ ## How to use
147
+
148
+ Install the toolkit:
149
+
150
+ ```bash
151
+ pip install speech-emotion
152
+ ```
153
+
154
+ ### Multimodal emotion recognition using audio and text
155
+
156
+ ```python
157
+ from speech_emotion import predict_emotion
158
+
159
+ emotion = predict_emotion(
160
+ audio_path="audio.wav",
161
+ text="I was really happy to see you again.",
162
+ language="en",
163
+ mode="multihead",
164
+ model_config_path="model.json"
165
+ )
166
+
167
+ print("Detected emotion:", emotion)
168
+ ```
169
+
170
+ ### Multimodal emotion recognition using automatic transcription (Whisper)
171
+
172
+ If no transcription is provided, the toolkit can automatically generate it using Whisper before performing emotion recognition.
173
+
174
+ ```python
175
+ from speech_emotion import predict_emotion
176
+
177
+ emotion = predict_emotion(
178
+ audio_path="audio.wav",
179
+ language="en",
180
+ mode="multihead",
181
+ model_config_path="model.json"
182
+ )
183
+
184
+ print("Detected emotion:", emotion)
185
+ ```
186
+
187
+ Repository:
188
+
189
+ https://github.com/NLP-UMUTeam/umuteam-speech-emotion
190
+
191
+ ## Limitations
192
+
193
+ - The model is designed for English multimodal emotion recognition and may not generalize reliably to other languages.
194
+ - It predicts a single label from a fixed set of seven emotions.
195
+ - Emotion expression is subjective and highly context-dependent.
196
+ - Performance may decrease with noisy audio, inaccurate transcriptions, overlapping speakers, or domain shifts.
197
+ - The model assumes that audio and text inputs are semantically aligned.
198
+ - Errors in automatic speech transcription may negatively affect multimodal performance.
199
+ - Attention-based fusion mechanisms may require larger datasets or different optimization strategies to generalize effectively.
200
+
201
+ ## Bias and ethical considerations
202
+
203
+ Emotion recognition systems may reflect biases present in their training data, including differences related to accents, speaking styles, demographics, recording conditions, or annotation subjectivity.
204
+
205
+ Users should avoid interpreting predictions as objective truths about a person's internal emotional state. The model should be used with transparency, appropriate consent, and human oversight, especially in sensitive contexts.
206
+
207
+ ## Citation
208
+
209
+ If you use this model in your research, please cite the following works:
210
+
211
+ ### speech-emotion toolkit
212
+
213
+ ```bibtex
214
+ @article{PAN2026102677,
215
+ title = {speech-emotion: A multilingual and multimodal toolkit for emotion recognition from speech},
216
+ journal = {SoftwareX},
217
+ volume = {34},
218
+ pages = {102677},
219
+ year = {2026},
220
+ issn = {2352-7110},
221
+ doi = {https://doi.org/10.1016/j.softx.2026.102677},
222
+ url = {https://www.sciencedirect.com/science/article/pii/S235271102600169X},
223
+ author = {Ronghao Pan and Tomás Bernal-Beltrán and José Antonio García-Díaz and Rafael Valencia-García},
224
+ }
225
+ ```
226
+
227
+ ## Acknowledgments
228
+
229
+ This work is part of the research project LaTe4PoliticES (PID2022-138099OB-I00), funded by MICIU/AEI/10.13039/501100011033 and the European Regional Development Fund (ERDF/EU - FEDER/UE), “A way of making Europe”.
230
+
231
+ Mr. Tomás Bernal-Beltrán is supported by the University of Murcia through the predoctoral programme.