Abstract: Modern speech applications require compact embeddings that generalize across both linguistic and paralinguistic tasks. However, most existing embeddings are task-specific and fail to transfer effectively across domains. We propose wavCSE, a feature-based multi-task learning model that produces a fixed-size unified speech embedding suitable for both linguistic and paralinguistic tasks. wavCSE is jointly trained on keyword spotting, speaker identification, and emotion recognition, achieving state-of-the-art performance on all three tasks. The resulting unified embedding is then evaluated on twelve downstream tasks spanning both linguistic and paralinguistic domains. Experimental results show that it outperforms strong baselines on nine of the twelve tasks, indicating effective generalization across domains. To streamline embedding generation, we introduce a recursive layer selection strategy to identify the most informative hidden layer outputs from the upstream model and refine how these selected outputs are aggregated in the downstream model. These enhancements reduce memory usage and computational cost while improving task performance, making them broadly applicable to self-supervised learning-based speech processing models.
Paper Type: Long
Research Area: Speech Recognition, Text-to-Speech and Spoken Language Understanding
Research Area Keywords: speech recognition, multi-task learning,
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches low compute settings-efficiency
Languages Studied: en, bn, de, el, es, fa, fr, gu, hi, it, kn, ml, mr, or, pa, ru, sa, ta, te, ur, zh
Reassignment Request Area Chair: This is not a resubmission
Reassignment Request Reviewers: This is not a resubmission
A1 Limitations Section: This paper has a limitations section.
A2 Potential Risks: N/A
B Use Or Create Scientific Artifacts: Yes
B1 Cite Creators Of Artifacts: Yes
B1 Elaboration: 2, 3
B2 Discuss The License For Artifacts: N/A
B3 Artifact Use Consistent With Intended Use: Yes
B3 Elaboration: 3
B4 Data Contains Personally Identifying Info Or Offensive Content: N/A
B5 Documentation Of Artifacts: N/A
B6 Statistics For Data: Yes
B6 Elaboration: 3
C Computational Experiments: Yes
C1 Model Size And Budget: Yes
C1 Elaboration: 4
C2 Experimental Setup And Hyperparameters: Yes
C2 Elaboration: 3
C3 Descriptive Statistics: Yes
C3 Elaboration: 4
C4 Parameters For Packages: Yes
C4 Elaboration: 2, 3
D Human Subjects Including Annotators: No
D1 Instructions Given To Participants: N/A
D2 Recruitment And Payment: N/A
D3 Data Consent: N/A
D4 Ethics Review Board Approval: N/A
D5 Characteristics Of Annotators: N/A
E Ai Assistants In Research Or Writing: No
E1 Information About Use Of Ai Assistants: N/A
Author Submission Checklist: yes
Submission Number: 688
Loading