Competition: Stuttering Event Probability Estimation (SEP-28k)

Problem statement
Participants are challenged to detect and quantify stuttering-related acoustic events in 3‑second speech clips. For each clip, predict the probability (0–1) of multiple labels that may co-occur. Labels reflect aggregated judgments from three annotators; target values are soft labels in [0,1] representing the fraction of annotators who marked each event.

Labels to predict (multi-label; not mutually exclusive)
- Prolongation
- Block
- SoundRep (sound repetition)
- WordRep (word repetition)
- Interjection (e.g., “um”, “uh”)
- NoStutteredWords (consensus of no stuttering event)
- PoorAudioQuality
- DifficultToUnderstand
- NaturalPause
- Music
- NoSpeech

What’s provided
- audio/train: WAV files for training, named {id}.wav
- audio/test: WAV files for testing, named {id}.wav
- train.csv: columns = id plus the 11 label columns above. Values are probabilities (counts/3) derived from three annotators.
- test.csv: columns = id (no labels).
- sample_submission.csv: correct submission format with random but valid values.

Notes on data
- Each recording is a 3‑second mono WAV clip. Sample rates and amplitudes may vary across clips; robust audio loading is recommended.
- Labels are soft because annotators may disagree; this also means labels like NoStutteredWords can co-occur with stutter indicators due to partial agreement. Treat the task as multi-label probability prediction rather than single-label classification.

Submission format
Submit a CSV with columns in this exact order:
- id, Prolongation, Block, SoundRep, WordRep, Interjection, NoStutteredWords, PoorAudioQuality, DifficultToUnderstand, NaturalPause, Music, NoSpeech

Requirements
- The id set must match test.csv exactly (one row per id; no extra/missing/duplicate ids).
- All prediction values must be numeric, finite, and in [0,1].

Evaluation metric (lower is better)
Submissions are evaluated with average soft binary cross-entropy across all labels and clips:
- For each label ℓ and clip i with target y_iℓ ∈ [0,1] and prediction p_iℓ ∈ (0,1), the loss is:  −[ y_iℓ log(p_iℓ) + (1−y_iℓ) log(1−p_iℓ) ].
- The final score is the mean loss over all pairs (i,ℓ). Lower is better.
This metric respects the soft nature of the targets and penalizes both over- and under-confidence.


Files in the dataset
- audio/train/{id}.wav
- audio/test/{id}.wav
- train.csv
- test.csv
- sample_submission.csv

Reproducibility
The official split is fixed and provided by the organizers. Do not attempt to infer test labels from filenames; files were anonymized to prevent leakage.
