Abstract: Self-supervised pretraining has transformed speech representation learning, enabling models to generalize across various downstream tasks. However, empirical studies have highlighted two notable gaps. First, different speech tasks require varying levels of acoustic and semantic information, which are encoded at different layers within the model. This adds the extra complexity of layer selection on downstream tasks to reach optimal performance. Second, the entanglement of acoustic and semantic information can undermine model robustness, particularly in varied acoustic environments. To address these issues, we propose a two-branch multitask finetuning strategy that integrates Automatic Speech Recognition and transcript-aligned audio reconstruction, designed to preserve and disentangle semantic and acoustic information in a final layer of a pretrained model. Experiments with the pretrained Wav2Vec 2.0 model demonstrate that our approach surpasses ASR-only finetuning across multiple downstream tasks, and it significantly improves ASR robustness in acoustically varied (emotional) speech.
Loading