NEST: Self-supervised Fast Conformer as All-purpose Seasoning to Speech Processing Tasks

He Huang, Taejin Park, Kunal Dhawan, Ivan Medennikov, Krishna C. Puvvada, Nithin Rao Koluguri, Weiqing Wang, Jagadeesh Balam, Boris Ginsburg

Published: 01 Jan 2025, Last Modified: 07 Oct 2025ICASSP 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Self-supervised learning (SSL) has been proved to benefit a wide range of speech processing tasks, such as speech recognition/translation, speaker verification and diarization, etc. However, most of current speech SSL approaches are computationally expensive. In this paper, we introduce a simplified and more efficient SSL framework, termed as NeMo Encoder for Speech Tasks (NEST). Specifically, we adopt the FastConformer architecture with 8x sub-sampling rate, which is faster than Transformer or Conformer architectures. Instead of clusteringbased quantization, we use fixed random projection for its simplicity and effectiveness. We also implement a generalized noisy speech augmentation that teaches the model to disentangle the main speaker from noise or other speakers. Experiments show that NEST improves over existing self-supervised models and achieves new state-of-the-art performance on a variety of speech processing tasks, such as speech recognition/translation, speaker diarization, spoken language understanding, etc. Code and checkpoints are publicly available via NVIDIA NeMo framework123.

External IDs:dblp:conf/icassp/HuangPDMPKWBG25