Modeling speech recognition and synthesis simultaneously: Encoding and decoding lexical and sublexical semantic information into speech with no access to speech dataDownload PDF

Anonymous

16 Nov 2021 (modified: 05 May 2023)ACL ARR 2021 November Blind SubmissionReaders: Everyone
Abstract: Human speakers encode information into raw speech which is then decoded by the listeners. This complex relationship between encoding (production) and decoding (perception) is often modeled separately. Here, we test how decoding of lexical and sublexical semantic information can emerge automatically from raw speech in unsupervised generative deep convolutional networks that combine both the production and perception principle. We introduce, to our knowledge, the most challenging objective in unsupervised lexical learning: an unsupervised network that must learn to assign unique representations for lexical items with no direct access to training data. We train several models (ciwGAN and fiwGAN by Beguš 2021) and test how the networks classify raw acoustic lexical items in the unobserved test data. Strong evidence in favor of lexical learning emerges. The architecture that combines the production and perception principles is thus able to learn to decode unique information from raw acoustic data in an unsupervised manner without ever accessing real training data. We propose a technique to explore lexical and sublexical learned representations in the classifier network. The results bear implications for both unsupervised speech synthesis and recognition as well as for unsupervised semantic modeling as language models increasingly bypass text and operate from raw acoustics.
0 Replies

Loading