Competition: Who Spoke It? Robust Speaker Recognition from 1‑Minute Audio Clips

Problem statement
Build a model that identifies the speaker of a 1‑minute mono 16 kHz WAV clip among the speakers present in the dataset (approximately 50). You will be given labeled training clips and an unlabeled test set of clips from the same speaker pool. Your task is to predict the correct speaker label for each test clip.

Why this is challenging
- Long-form speech is split into 1‑minute segments. Adjacent segments can be highly similar, so you must generalize beyond local memorization.
- Many speakers and thousands of clips require careful feature engineering (e.g., MFCCs, log‑mel spectrograms, x‑vectors/ECAPA‑TDNN embeddings) and robust modeling.
- Acoustic variability (backgrounds, dynamics, speaking styles) and variable content demand strong preprocessing and regularization.

Data you will use (after running prepare.py)
- train_audio/  — training WAV files (anonymized filenames)
- test_audio/   — test WAV files (anonymized filenames)
- train_metadata.csv — columns: file_id, filepath, label
- test_data.csv — columns: file_id, filepath
- sample_submission.csv — columns: file_id, label (valid example labels)

Conventions
- file_id uniquely identifies each clip (without the .wav extension). Use it to align predictions with test_data.csv.
- filepath is a relative path to the audio file under train_audio/ or test_audio/.
- label is an anonymized speaker ID (e.g., S000, S001, …). All labels that appear in the test set also appear in the training set.

Evaluation
- Metric: Macro F1 score over the speaker classes found in the training set.
  - For each class, compute F1 = 2 * (precision * recall) / (precision + recall) with 0‑safe handling; the final score is the unweighted mean across classes.
  - Macro F1 treats each class equally and is robust to mild imbalance.
- Submission format: CSV with header and exactly two columns
  file_id,label
  Your file must contain every file_id from test_data.csv exactly once, and each label must be one of the labels present in train_metadata.csv.

Train/test split (how prepare.py constructs the data)
- For each speaker directory, clips are sorted by their natural/chronological order (inferred from filenames). The first ~80% of clips go to the training set and the last ~20% to the test set (rounded so that every multi‑clip speaker has both train and test examples; single‑clip speakers go to test).
- Filenames are anonymized (e.g., clip_000123.wav) and labels are mapped to S000, S001, … to prevent label leakage.

Files you will see after running prepare.py
- train_audio/
- test_audio/
- train_metadata.csv
- test_data.csv
- sample_submission.csv

Submission example
file_id,label
clip_000001,S013
clip_000002,S042
...

Have fun building robust speaker recognition systems that generalize across content and acoustic conditions!