Abstract: Speech disfluency research is pivotal to accommodating atypical speakers in mainstream conversational technology. However, the lack of publicly available labeled and unlabeled datasets is a significant bottleneck to such research. While many works use pseudo dysfluency data with proxy labels and formulate a self-supervised task, we see merit in using real-world data. In this work, we consolidate the corpora of publicly available speech disfluency datasets with and without labels and propose DisfluentSiam – an efficient siamese network-based small-scale pretraining pipeline using task-specific data from multiple domains with only 10M trainable parameters. We show that with DisfluentSiam, we achieve an average of 15% boost in performance across five types of dysfluency event detection compared to direct wav2vec 2.0 embeddings. In particular, with only 4-5 mins of labeled data for fine-tuning, the DisfluentSiam demonstrates the advantage of task-specific pretraining with up to 25% higher accuracy.
Loading