Keywords: Spoken word recognition, speech perception, Bayesian modeling, talker familiarity
TL;DR: We are developing a new adaptive Bayesian model of human spoken word recognition
Abstract: Existing models of spoken-word recognition such as TRACE [1], the Distributed Cohort Model [2], and Shortlist B [3] each offer computationally explicit accounts of how speech is recognized and each can simulate an impressive range of phenomena. But these models fail to account for talker variability: They are effectively models of how words might be recognized if they were always spoken by the same talker. This deficiency is striking, given substantial evidence that word recognition is talker-contingent [4] and that listeners can tune in to the idiosyncrasies of individual talkers’ speech [5]. These models have a second striking deficiency. Listeners use prosodic information (e.g., lexical stress cues) in word recognition [6] and yet the models are effectively prosodically deaf (i.e., they treat speech as if it were made up of strings of vowels and consonants with no suprasegmental structure).
We present a new model of spoken-word recognition: the Adaptive Bayesian Continuous-speech (ABC) model. The ABC model addresses the above two deficiencies: It is the first of its kind to handle talker variability and to not be prosodically deaf. The ABC model takes inspiration from three other Bayesian ideal observer models: Shortlist B [3], the Bayesian Prosody Recognizer [6], and the ideal adapter framework [7]. Here, we present ABC’s architecture and show how the model handles talker variability in the realization of fricative consonants.
A critical assumption (built into the model’s name) is that speech recognition is adaptive: talker variability is dealt with through retuning of perception. ABC therefore has adaptive voice “plug-ins” for multiple talkers. The core idea is that ABC has knowledge of the distributional properties of the acoustic cues to different phonological categories (i.e., pdfs), knowledge about how these likelihood functions vary across talkers, and the ability, through the deployment of plug-ins with this knowledge, to retune the recognition process as the input changes from one talker to another. The first step towards a full ABC model is the implementation of plug-ins for talker-specific cues to fricative consonants using existing data on lexically-guided perceptual learning in Dutch [5,8]. Different likelihood functions for different talkers will be constructed for acoustic cues to the fricative contrast (e.g., spectral centre of gravity for [f] and [s]), modelling the input in the exposure phase of [5]). The model will then be evaluated, as in the test phase of [8], on its ability to recognize words containing the fricatives and, critically, whether this is a talker-contingent process. The model will do talker recognition through bottom-up template matching. Then the talker’s plug-in will be deployed: fricative likelihoods will be adjusted in a talker-specific way, leading to changes in the recognition of words containing those fricatives.
This work is ongoing and simulation results are not yet available. We hope to be able to show that the ABC model can simulate generalization of talker-specific learning about fricatives across a large (20,000 word) lexicon. In future work, we will (a) add prosodic structure to the model, (b) address prosodic talker variability, and (c) capture how listeners deal with groups of talkers who share the same regional or foreign accent. The fricative retuning simulations, however, will already offer an existence proof that talker-specific adaptability can be built into a large-lexicon Bayesian model.
Submission Number: 9
Loading