Unsupervised Acquisition of Phonemes, Words, and Grammar from Continuous Speech Signals

Shoma Ochiai; Masatoshi Nagano; Tomoaki Nakamura

Unsupervised Acquisition of Phonemes, Words, and Grammar from Continuous Speech Signals

Shoma Ochiai, Masatoshi Nagano, Tomoaki Nakamura

Published: 01 Jan 2024, Last Modified: 12 Jun 2025ICDL 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Humans can acquire language by segmenting continuous speech signals with a double articulation structure into phonemes and words without explicit boundary points or labels, and learn the transition rules of words as grammar. Learning the double articulation structure of speech signals is crucial for realizing robots with similar language learning abilities to those of humans. In this study, we propose a novel probabilistic generative model (PGM) that can learn phonemes, words, and grammar from continuous speech signals by hierarchically connecting the Gaussian process hidden semi-Markov model and hidden semi-Markov model (HSMM). In the proposed method, the parameters of each PGM are updated mutually and the grammatical structures affect the phonemes and words, thereby enabling the accurate learning of phonemes and words. The experimental results reveal that the proposed approach, including grammar learning, can segment continuous speech into phonemes and words more accurately than conventional methods. Furthermore, we found that grammar learning significantly affected the accurate estimation of the number of words in the sentence.

Loading