SynCLR: A Synthesis Framework for Contrastive Learning of out-of-domain Speech Representations

Rongjie Huang; Max W. Y. Lam; Jun Wang; Dan Su; Dong Yu; Zhou Zhao; Yi Ren

SynCLR: A Synthesis Framework for Contrastive Learning of out-of-domain Speech Representations

Rongjie Huang, Max W. Y. Lam, Jun Wang, Dan Su, Dong Yu, Zhou Zhao, Yi Ren

29 Sept 2021 (modified: 13 Feb 2023)ICLR 2022 Conference Withdrawn SubmissionReaders: Everyone

Keywords: Contrastive learning, Domain generalization, Speech Synthesis, Diffusion Probabilistic Models

Abstract: Learning generalizable speech representations for unseen samples in different domains has been a challenge with ever increasing importance to date. Although contrastive learning has been a prominent class of representation learning approaches, the state-of-the-art (SOTA) contrastive learning methods were found to have limited ability for learning unseen out-of-domain speech representations. This paper presents SynCLR, a synthesis framework for contrastive learning of speech representations that can be generalized over unseen domains. Specifically, instead of using data augmentation approach, SynCLR employs data synthesis for multi-view generation. To ensure a highly-varied conditional speech distribution in view generation, we design a novel diffusion-based speech synthesizer. A new contrastive loss is also proposed to construct multiple embedding spaces, each of which preserves view-sensitive information to reduce domain reliance for a better disentanglement. Our experiments showed that SynCLR outperformed the SOTA contrastive learning methods with a 17.2\% relative reduction of EER in speaker verification tested on an unseen speech corpus, and considerably reduced 50.8\% relative FIDs in a challenging speech-to-image translation task given out-of-domain test speeches.

One-sentence Summary: We propose SynCLR, a synthesis framework for contrastive learning of speech representations that can be generalized over unseen domain.

13 Replies

Loading