Abstract: A number of studies have compared human and machine transcription, showing that automatic speech recognition (ASR) is approaching human performance in some contexts. Most studies look at differences as measured by the standard speech recognition scoring criterion: word error rate (WER). This study looks at more fine-grained analysis of differences for conversational speech data where systems have reached human parity in terms of average WER, specifically insertions vs. deletions, word category, and word context characterized by linguistic surprisal. In contrast to ASR systems, humans are more likely to miss words than to misrecognize them, and they are much more likely to make errors in transcribing words associated primarily with conversational contexts (fillers, backchannels and discourse cue words). The differences are more pronounced for more informal contexts, i.e. conversations between family members. Although human transcribers may miss these words, conversational partners seem to use them in turntaking and processing disfluencies. Thus, ASR systems may need superhuman transcription performance for spoken language technology to achieve human-level conversation skills.
Loading