Abstract: Pronunciation assessment remains a subjective task which depends on a pronunciation reference hold as canonical. Whether a second language (L2) speaker is able to replicate said reference is decided by an assessor who perceives the identity of the sounds produced. It is known that the assessor has a bias caused by the perception of the speaker, hence the definition of a standard for L2 pronunciation is crucial in a formal assessment. In Computer Assisted Pronunciation Assessment (CAPA), the definition of a pronunciation standard for L2 is not trivial due to limited L2 data annotated for mispronunciations. Inspired on the assessor’s bias, this work explores an alternative to a conventional Automatic Speech Recognition approach for CAPA by using speaker metadata along with acoustic observations for mispronunciation detection. A combination of Bidirectional Long-Short Memory with self-attention was used to detect pronunciation errors in short speech segments. It was found that the use of categorical metadata can have a positive effect in the classification of mispronounced segments depending on the sparsity and balance of the classes. It was also found that different assessors can be influenced differently by information about the speaker’s linguistic background. The effect of the metadata was tested on data from Dutch children learners of English as L2 in schools across the Netherlands. The limited speaker diversity of the corpus made the task a challenge worth keep exploring.
Loading