wav2tok: Deep Sequence Tokenizer for Audio Retrieval

Adhiraj Banerjee; Vipul Arora

wav2tok: Deep Sequence Tokenizer for Audio Retrieval

Adhiraj Banerjee, Vipul Arora

Published: 01 Feb 2023, Last Modified: 17 Feb 2023ICLR 2023 posterReaders: Everyone

Keywords: sequence representation learning, audio search, music retrieval

TL;DR: Represent query and target sequences as compressed token sequences for quick retrieval; similarity semantics are learned from sequence pairs

Abstract: Search over audio sequences is a fundamental problem. In this paper, we propose a method to extract concise discrete representations for audio that can be used for efficient retrieval. Our motivation comes from orthography which represents speech of a given language in a concise and distinct discrete form. The proposed method, wav2tok, learns such representations for any kind of audio, speech or non-speech, from pairs of similar audio. wav2tok compresses the query and target sequences into shorter sequences of tokens that are faster to match. The learning method makes use of CTC loss and expectation-maximization algorithm, which are generally used for supervised automatic speech recognition and for learning discrete latent variables, respectively. Experiments show the consistent performance of wav2tok across two audio retrieval tasks: music search (query by humming) and speech search via audio query, outperforming state-of-the-art baselines.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Submission Guidelines: Yes

Please Choose The Closest Area That Your Submission Falls Into: Unsupervised and Self-supervised learning

Supplementary Material: zip

20 Replies

Loading