Representing speech through autoregressive prediction of cochlear tokens

Greta Tuckute; Klemen Kotar; Evelina Fedorenko; Dan Yamins

Representing speech through autoregressive prediction of cochlear tokens

Greta Tuckute, Klemen Kotar, Evelina Fedorenko, Dan Yamins

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: audio, speech, biology-inspired model, autoregressive prediction, cochlea

TL;DR: We propose a speech representation model, CochStream, which leverages simple autoregressive prediction on a time-frequency representation inspired by the human cochlea.

Abstract: We introduce a biologically-inspired model for encoding speech through an autoregressive prediction objective applied to input representations modeled after the human cochlea. Our modeling framework is inspired by the human auditory processing hierarchy. The first stage of our framework transforms the raw audio waveform to a time-frequency representation inspired by the human cochlea, with an intermediary step that effectively discretizes the audio representations (cochlear tokens). The second stage of our model learns a simple, yet powerful, autoregressive sequence model over the discretized audio input. We demonstrate that our model learns meaningful representations of phonemes and word identities, and state-of-the-art representations of lexical semantic similarity. In addition, our model shows competitive performance on several downstream audio tasks from the SUPERB benchmark. In addition to our model’s strong representational capabilities, we demonstrate our model's ability to generate continuations of audio at various temporal scales, which can be visualized in a cochleagram time-frequency space to provide insights into the model's predictions. Our model provides a novel framework for speech representation learning, aiming to advance the development of more human-like models that flexibly and efficiently handles a range of speech-based tasks.

Supplementary Material: pdf

Primary Area: applications to neuroscience & cognitive science

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 11854

Loading