Title: Spoken Numerals with Intonation – Multitask Audio Classification

Overview
You are given short English speech clips of spoken numerals produced under different intonations. Your goal is to build a model that, for each audio clip, predicts both:
- Numeral: the spoken number (e.g., 0–100, 1,000, 1,000,000, 1,000,000,000, etc.)
- Intonation: one of {neutral, bored, excited, question}

The challenge emphasizes generalization to unseen speakers. The train/test split is done by disjoint speaker groups so models must learn features that transfer across voices, accents, and recording conditions.

Public Directory Structure
All files you will use are under the public/ directory:
- public/train.csv: training metadata with labels
  - id: anonymized clip identifier
  - audio_path: relative path to the corresponding WAV file under public/train_audio/
  - Numeral: integer label for the spoken number
  - Intonation: categorical label in {neutral, bored, excited, question}
  - Speaker: integer speaker id (for analysis; not present in test.csv)
- public/test.csv: test metadata (no labels)
  - id
  - audio_path: relative path under public/test_audio/
- public/train_audio/: audio files for the training set (WAV, mono)
- public/test_audio/: audio files for the test set (WAV, mono)
- public/sample_submission.csv: a valid sample submission with random but plausible labels

Task
For each row in public/test.csv, use the corresponding audio clip in public/test_audio/ to predict Numeral and Intonation.

Evaluation Metric
Submissions are evaluated with a weighted accuracy score that rewards predicting both tasks correctly:
- Numeral accuracy (weight 0.7)
- Intonation accuracy (weight 0.3)
- Joint bonus for samples where both Numeral and Intonation are correct (weight 0.2)

Final score = 0.7 × NumeralAccuracy + 0.3 × IntonationAccuracy + 0.2 × JointAccuracy, clipped to [0, 1].

Where:
- NumeralAccuracy is the fraction of test samples with exactly correct Numeral.
- IntonationAccuracy is the fraction of test samples with exactly correct Intonation.
- JointAccuracy is the fraction of test samples where both predictions are simultaneously correct.

Submission Format
Create a CSV file with exactly these columns and order:
- id, Numeral, Intonation

Rules:
- id must match exactly the ids in public/test.csv (one row per id, no duplicates).
- Numeral must be an integer (e.g., 0, 1, 2, …, 1000, 1000000, 1000000000).
- Intonation must be one of: neutral, bored, excited, question (case-insensitive accepted; evaluated in lower case).

Design Notes
- Speaker-disjoint split: all speakers in public/test.csv are unseen in public/train.csv to enforce generalization across voices and accents.
- Label space coverage: all four intonation classes are represented in both train and test; the test set includes broad coverage of numerals.
- Filenames are anonymized and do not contain label information to prevent leakage.

Good luck and have fun building inclusive, speaker-robust speech models!
