PetraMicanovic's picture
Add README.md
ade1892
|
Raw
History Blame Contribute Delete
2.06 kB
metadata
language: en
license: mit
library_name: pytorch
task_categories:
  - multimodal-classification
tags:
  - emotion-recognition
  - multimodal
  - early-fusion
  - text
  - audio
  - meld

Early Fusion Emotion Recognition on MELD (Text + Audio)

This repository contains an early fusion multimodal emotion recognition model trained on the MELD dataset, combining textual and acoustic embeddings at the feature level.

Both modalities are encoded independently and fused via embedding concatenation before classification.


Model Overview

  • Text encoder: bert-base-uncased (frozen)
  • Audio encoder: pre-extracted acoustic features (frozen)
  • Fusion strategy: Early fusion (concatenation)
  • Classifier: 2-layer MLP
  • Training strategy:
    • Both encoders are frozen
    • Only the fusion classifier is trained


Dataset

  • Name: MELD (declare-lab/MELD)
  • Modalities: Text + Audio
  • Setting: Multi-class emotion classification
  • Splits: Train / Validation / Test (official MELD splits)

Training Details

  • Loss: Cross-entropy
  • Optimizer: Adam
  • Fusion dimension: 1536
  • Evaluation metrics:
    • Accuracy
    • Macro F1-score
    • Per-class F1-score

Important Notes

  • This model does not perform end-to-end multimodal fine-tuning.
  • Both text and audio encoders act as frozen feature extractors.
  • The provided weights correspond only to the fusion classifier.

To reproduce results, identical feature extraction pipelines must be used for both modalities.


Intended Use

  • Multimodal emotion recognition research
  • Comparison with unimodal baselines
  • Early vs. late fusion analysis
  • Educational and academic purposes

Limitations

  • Temporal context across utterances is not modeled
  • Speaker identity is not used
  • Fusion is limited to simple feature concatenation
  • Performance depends on quality of pre-extracted features