Challenge: Multimodal Emotion Recognition in Conversations (MERC)

Problem statement
Build a model that predicts the emotion of each utterance in a multi‑party dialogue from the Friends TV series. For every utterance, you are given the transcript text, metadata (speaker, dialogue_id, utterance_id, season/episode, timestamps), and a trimmed video segment (with both audio and visual streams). Predict one of seven emotions for every utterance:
- anger, disgust, sadness, joy, neutral, surprise, fear

Competition intent
This is a multimodal, context‑rich classification task. High‑quality solutions typically combine text with prosody and facial cues, and many approaches benefit from modeling context across utterances using the dialogue_id and utterance_id.

Files provided to participants
- train.csv: Labeled training data. Columns:
  - id: Unique identifier for an utterance (maps to a media file name; see video column)
  - utterance: Text transcript
  - speaker: Speaker name
  - dialogue_id: Integer dialogue identifier
  - utterance_id: Integer position of the utterance within its dialogue
  - season, episode: TV metadata
  - start_time, end_time: Episode timecodes for the utterance
  - video: Relative path to the utterance‑level MP4 clip (contains both audio and visual streams)
  - emotion: Target label in {anger, disgust, sadness, joy, neutral, surprise, fear}

- test.csv: Same columns as train.csv except the emotion column is omitted. The id values correspond 1‑to‑1 with the provided media clips.

- sample_submission.csv: Example of the expected submission format with random but valid labels.

- media/train/*.mp4 and media/test/*.mp4: Utterance‑level clips for the train and test splits. Filenames are anonymized; only join via the id and video columns to avoid label leakage.

Task
Train a model to predict the emotion for each row in test.csv using any subset of the provided modalities (text, audio, visual) and/or dialogue context. External pretraining is allowed, but labels beyond train.csv are not.

Evaluation
Submissions are evaluated using macro‑averaged F1 across the seven classes. For each class c, we compute F1(c) from precision and recall, then report the unweighted mean across classes. Macro F1 treats all classes equally and mitigates class imbalance.

Submission format
- File name: any.csv
- Columns: exactly two with header id,emotion
- id values must match test.csv exactly (no extra rows, no missing rows, no duplicates)
- emotion must be one of: anger, disgust, sadness, joy, neutral, surprise, fear

Data split and integrity
- Training data merges the original training and development annotations after de‑duplicating exact duplicates and dropping a tiny number of entries lacking a matching media file.
- Test data is held‑out and disjoint at the utterance level. Speakers overlap with the training set; distribution shift is realistic.
- Media filenames are anonymized to prevent leakage from original names.


Good luck and have fun building robust multimodal emotion recognizers!
