Audio Word2vec: Sequence-to-Sequence Autoencoding for Unsupervised Learning of Audio Segmentation and Representation

Yi-Chen Chen, Sung-Feng Huang, Hung-yi Lee, Yu-Hsuan Wang, Chia-Hao Shen

2019 (modified: 11 Nov 2021)IEEE ACM Trans. Audio Speech Lang. Process. 2019Readers: Everyone

Abstract: In text, word2vec transforms each word into a fixed-size vector used as the basic component in applications of natural language processing. Given a large collection of unannotated audio, audio word2vec can also be trained in an unsupervised way using a sequence-to-sequence autoencoder (SA). These vector representations are shown to effectively describe the sequential phonetic structures of the audio segments. In this paper, we further extend this research in the following two directions. First, we disentangle phonetic information and speaker information from the SA vector representations. Second, we extend audio word2vec from the word level to the utterance level by proposing a new segmental audio word2vec in which unsupervised spoken word boundary segmentation and audio word2vec are jointly learned and mutually enhanced, and utterances are directly represented as sequences of vectors carrying phonetic information. This is achieved by means of a segmental sequence-to-sequence autoencoder, in which a segmentation gate trained with reinforcement learning is inserted in the encoder.

0 Replies