--- language: en tags: - emotion-recognition - audio-visual - multimodal - meld license: apache-2.0 --- # AVERFormer-v4 (MELD) Multimodal Audio-Visual-Text Emotion Recognition transformer trained on MELD. Code: https://github.com/mhussainahmad/AVERFormer ## Reported numbers - Best single-seed val wF1: **0.3815709082075283** (seed 1337, epoch 4) - Best ensemble wF1: **0.6150 (ensemble), 0.6128 (10-seed avg)** ## Classes (7) ['neutral', 'joy', 'sadness', 'anger', 'fear', 'disgust', 'surprise'] ## Architecture - Audio: `microsoft/wavlm-large` (16 kHz mono waveform) - Video: `MCG-NJU/videomae-large` (16 frames @ 224x224 RGB) - Text: `microsoft/deberta-v3-large` (speaker-aware ctx encoder) - Fusion: 2-layer cross-modal transformer, dim=512, 8 heads - Heads: face / voice / text / joint (all share class count) ## Loading ```python import json, torch from huggingface_hub import hf_hub_download from models.averformer_v4 import AVERFormerV4 cfg = json.load(open(hf_hub_download(repo_id="mhussainahmad/averformer-meld-v4", filename="config.json"))) ckpt = hf_hub_download(repo_id="mhussainahmad/averformer-meld-v4", filename="pytorch_model.pth") model = AVERFormerV4( audio_backbone=cfg["audio_backbone"], video_backbone=cfg["video_backbone"], text_backbone=cfg["text_backbone"], num_classes=cfg["num_classes"], fusion_layers=cfg["fusion_layers"], lora_r=cfg["lora_r"], use_text=True, ) state = torch.load(ckpt, map_location="cpu", weights_only=False) model.load_state_dict(state["model"], strict=False) model.eval() ``` ## Live inference ```bash python live_emotion_v4.py --repo_id mhussainahmad/averformer-meld-v4 ``` See `LIVE_INFERENCE_README.md` in the GitHub repo for full setup.