Abstract: Oral reading fluency assessment is a process where a student reads a passage aloud and is scored against words read correctly by a human listener. Current automatic reading fluency systems match these words read using speech recognition models trained with clean speech data from native adult speakers. This mismatch in training and deployment, compounded by numerous background noises from the classroom, means that student speech is often not correctly recognized. This paper describes a deep learning model that employs text-to-speech and contrastive learning to create acoustic word embeddings of student speech. This embedding is trained with unlabeled data of students reading known passages. Our model then uses sub-sequence matching in the acoustic embedding space to estimate words read correctly per minute, a common criterion in oral reading fluency. Our model’s words read correctly per minute is significantly closer to human listeners compared to systems that use automatic speech recognition only, reducing error of words correct per minute from 15.1 to 8.4, on average.
External IDs:dblp:conf/icassp/WangWNKNL24
Loading