Competition: Robust Speaker-Independent Keyword Recognition

Problem statement
You are given short audio clips of spoken words from a diverse set of speakers. The task is to build a model that recognizes the spoken command in each clip and predicts the correct label for every test clip. The split is speaker-disjoint to evaluate true generalization to unseen speakers.

Why it’s challenging
- Many classes (≈35 command categories) with class imbalance.
- Variations in speakers, accents, and recording conditions.
- Short utterances where timing and transient information matter.
- Robustness to background noise and silence segments.

Files provided
- train.csv: two columns [id, label]. Each id corresponds to a WAV file in train_audio/.
- train_audio/: audio clips for training. File names are anonymized and do not reveal labels.
- test_data.csv: one column [id]. Each id corresponds to a WAV file in test_audio/.
- test_audio/: audio clips for testing. File names are anonymized and do not reveal labels.
- sample_submission.csv: example of the required submission format with valid placeholder labels.
- labels.json: list of class labels for convenience (participants can also infer labels from train.csv).

Audio format
- WAV files (mono, 16-bit PCM). Clip durations are roughly one second, but you should handle potential small variations robustly.

Evaluation metric
- Macro-averaged F1 score across all classes.
  - For each class c: F1_c = 2 * Precision_c * Recall_c / (Precision_c + Recall_c), with F1_c = 0 when Precision_c + Recall_c = 0.
  - The final score is the arithmetic mean of F1_c over the full label set.
- Macro F1 emphasizes performance on all classes equally and is robust to class imbalance.

Submission format
- A single CSV file with two columns: id,label
- id must match exactly the ids listed in test_data.csv (one prediction per id, no extras or missing rows).
- label must be one of the allowed class names appearing in train.csv (the same list is in labels.json).

Data split and leakage control
- The dataset is split by speaker: no speaker appears in both train and test.
- Filenames are anonymized to avoid revealing labels.
- Every class appears at least once in both the training and test sets so that all classes are evaluable.


Deliverables checklist
- Train your model on train.csv + train_audio/.
- Produce a CSV with predictions for all ids in test_data.csv.
- Ensure labels are valid and cover all test ids.

Notes
- There is no restriction on using standard audio libraries and pre-trained audio front-ends, provided your submission abides by the required format.
