NecoBERT: Self-Supervised Learning Model Trained by Masked Language Modeling on Rich Acoustic Features Derived from Neural Audio Codec

Wataru Nakata, Takaaki Saeki, Yuki Saito, Shinnosuke Takamichi, Hiroshi Saruwatari

Published: 2024, Last Modified: 18 Mar 2026APSIPA 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: We propose NecoBERT, a self-supervised learning (SSL) model to extract speech features containing semantic and acoustic information. Unsupervisedly learned representations using massive speech data, such as SSL and neural audio codec (NAC) features, are essential for developing speech foundation models. However, because conventional methods typically capture one specific aspect of speech, i.e., semantic or acoustic information, the learned features are not always suitable for complicated downstream tasks. NecoBERT addresses this issue by masked language modeling similar to HuBERT on NAC-derived rich acoustic features, enabling the acoustic features to represent semantic information as well. We evaluate NecoBERT on recognition tasks taken from the SUPERB benchmark and speech resynthesis tasks. The results show that although NecoBERT does not outperform HuBERT in some recognition tasks, it performs superiorly in speaker verification and speech resynthesis.

External IDs:dblp:conf/apsipa/NakataSSTS24