LJ Speech ASR Challenge: Robust Transcription from Single-Speaker Audio

Problem statement
Build an automatic speech recognition (ASR) system that transcribes short English audio clips into text. You are given thousands of single-speaker recordings and their normalized transcriptions for training. Your task is to predict accurate transcripts for a held-out test set of clips. This competition emphasizes end-to-end modeling and careful data processing for robust transcription.

Why it’s challenging
- The dataset contains 13k clips from a single speaker with diverse content and sentence lengths.
- Audio spans 1–10 seconds; variation in speaking rate and prosody requires robust modeling.

Data description
You will receive the following after running prepare.py:
- public/train.csv: columns [clip_id, audio_path, transcript]
- public/test.csv: columns [clip_id, audio_path]
- public/train_wavs/: audio files for training (WAV; mono, 22.05 kHz)
- public/test_wavs/: audio files for testing (WAV; mono, 22.05 kHz)
- public/sample_submission.csv: example submission with randomized but valid transcripts
- private/test_answer.csv: hidden ground-truth transcripts for evaluation

Notes
- clip_id is an anonymized identifier (e.g., clip_000123). File names do not reveal content.
- audio_path is an absolute path to the clip file on disk.
- transcript strings are normalized (lowercased, punctuation and special symbols reduced). You may further normalize in your pipeline, but the evaluation will apply a fixed normalization before scoring.

Submission format
- File name: submission.csv
- CSV with header and columns: [clip_id, transcript]
- Include every clip_id from public/test.csv exactly once. Order does not matter.
- transcript can be any UTF-8 string (empty allowed), but your score will benefit from high-quality text output.

Evaluation
- Metric: Word Error Rate (WER) averaged across clips after standard normalization (lower is better).
- Normalization: lowercasing, ASCII folding, removal of punctuation except internal apostrophes, whitespace collapsing.
- See metric.py for the exact implementation used in local validation and scoring.

Train/test split
- The dataset is split into train/test with an 80/20 proportion at the utterance level, grouped by identical normalized transcripts to avoid exact-duplicate leakage.
- File names are anonymized to prevent label leakage.
- All audio files corresponding to the split are provided under train_wavs/ and test_wavs/ and aligned with the CSVs.


Files provided in public/
- train.csv
- test.csv
- sample_submission.csv
- train_wavs/
- test_wavs/

Hidden in private/
- test_answer.csv

Good luck!