A comparative analysis of automatic speech recognition errors in small group classroom discourse.

Jie Cao, Ananya Ganesh, Jon Cai, Rosy Southwell, E. Margaret Perkoff, Michael Regan, Katharina Kann, James H. Martin, Martha Palmer, Sidney D'Mello

07 Jun 2023OpenReview Archive Direct UploadReaders: Everyone

Abstract: In collaborative learning environments, effective intelligent learning systems need to accurately analyze and understand the collaborative discourse between learners (i.e., group modeling) to provide adaptive support. We investigate how automatic speech recognition~(ASR) errors influence discourse models of small group collaboration in noisy real-world classrooms. Our dataset consisted of 30 students recorded by consumer off-the-shelf microphones~(Yeti Blue) while engaging in dyadic- and triadic- collaborative learning in a multi-day STEM curriculum unit. We found that two state-of-the-art ASR systems (Google Speech and OpenAI Whisper) yielded very high word error rates (0.822, 0.847) but very different profiles of error with Google being more conservative, rejecting 38\% of utterances instead of 12\% for Whisper. Next, we examined how these ASR errors influenced down-stream small group modeling based on pre-trained large language models for three tasks: Abstract Meaning Representation parsing~(\NameAMR), on-task/off-task detection~(\NameOnTask), and Accountable Productive Talk prediction~(\NameAPT). As expected, models trained on clean human transcripts yielded degraded performance on all three tasks, measured by the transfer ratio~(TR). However, the TR of the specific sentence-level \NameAMR~task~(.39 - .62) was much lower than that of the abstract discourse-level \NameOnTask~(.63- .94) and \NameAPT~ tasks~(.64-.72). Furthermore, different training strategies that incorporated ASR transcripts alone or as augmentations of human transcripts increased accuracy for the discourse-level tasks~(\NameOnTask~and \NameAPT) but not \NameAMR. Simulation experiments suggested that the models were tolerant of missing utterances in the dialog context, and that jointly improving ASR accuracy on important word classes~(e.g., verbs and nouns) can improve performance across all tasks. Overall, our results provide insights into how different types of NLP-based tasks might be tolerant of ASR errors under extremely noisy conditions and provide suggestions for how to improve accuracy in small group modeling settings for a more equitable, engaging, and adaptive collaborative learning environment.

0 Replies