--- language: en license: mit library_name: pytorch task_categories: - multimodal-classification tags: - emotion-recognition - multimodal - early-fusion - text - audio - meld --- # Early Fusion Emotion Recognition on MELD (Text + Audio) This repository contains an **early fusion multimodal emotion recognition model** trained on the **MELD dataset**, combining **textual and acoustic embeddings** at the feature level. Both modalities are encoded independently and fused via **embedding concatenation** before classification. --- ## Model Overview - **Text encoder:** `bert-base-uncased` (frozen) - **Audio encoder:** pre-extracted acoustic features (frozen) - **Fusion strategy:** Early fusion (concatenation) - **Classifier:** 2-layer MLP - **Training strategy:** - Both encoders are **frozen** - Only the fusion classifier is trained --- --- ## Dataset - **Name:** MELD (declare-lab/MELD) - **Modalities:** Text + Audio - **Setting:** Multi-class emotion classification - **Splits:** Train / Validation / Test (official MELD splits) --- ## Training Details - **Loss:** Cross-entropy - **Optimizer:** Adam - **Fusion dimension:** 1536 - **Evaluation metrics:** - Accuracy - Macro F1-score - Per-class F1-score --- ## Important Notes - This model does **not perform end-to-end multimodal fine-tuning**. - Both text and audio encoders act as **frozen feature extractors**. - The provided weights correspond **only to the fusion classifier**. To reproduce results, identical feature extraction pipelines must be used for both modalities. --- ## Intended Use - Multimodal emotion recognition research - Comparison with unimodal baselines - Early vs. late fusion analysis - Educational and academic purposes --- ## Limitations - Temporal context across utterances is not modeled - Speaker identity is not used - Fusion is limited to simple feature concatenation - Performance depends on quality of pre-extracted features