Abstract: Interpretability research has shown that self-supervised Spoken Language
Models (SLMs) encode a wide variety of features in human speech from the
acoustic, phonetic, phonological, syntactic and semantic levels, to speaker
characteristics. The bulk of prior research on representations of phonology
has focused on segmental features such as phonemes; the encoding of
suprasegmental phonology (such as tone and stress patterns) in SLMs is not yet
well understood. Tone is a suprasegmental feature that is present in more than
half of the world's languages. This paper aims to analyze the tone encoding
capabilities of SLMs, using Mandarin and Vietnamese as case studies. We show
that SLMs encode lexical tone to a significant degree even when they are
trained on data from non-tonal languages. We further find that SLMs behave
similarly to native and non-native human participants in tone and consonant
perception studies, but they do not follow the same developmental trajectory.
Paper Type: long
Research Area: Speech recognition, text-to-speech and spoken language understanding
Contribution Types: Model analysis & interpretability
Languages Studied: Mandarin Chinese, Vietnamese
0 Replies
Loading