Cross-lingual Matryoshka Representation Learning across Speech and Text

Cross-lingual Matryoshka Representation Learning across Speech and Text

ACL ARR 2026 January Submission6561 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Speech Processing, Efficiency, Multimodal, Multilingual, Embedding LLMs

Abstract: Speakers of under-represented languages face both a language barrier, as most online knowledge is in a few dominant languages, and a modality barrier, since information is largely text-based while many languages are primarily oral. We address this for French-Wolof by training the first bilingual speech-text Matryoshka embedding model, enabling efficient retrieval of French text from Wolof speech queries without relying on a costly ASR-translation pipelines. We introduce large-scale data curation pipelines and new benchmarks, compare modeling strategies, and show that modality fusion within a frozen text Matryoshka model performs best. Although trained only for retrieval, the model generalizes well to other tasks, such as speech intent detection, indicating the learning of general semantic representations. Finally, we analyze cost-accuracy trade-offs across Matryoshka dimensions and ranks, showing that information is concentrated only in a few components, suggesting potential for efficiency improvements.

Paper Type: Long

Research Area: Speech Processing and Spoken Language Understanding

Research Area Keywords: speech technologies, spoken language understanding; QA via spoken queries

Contribution Types: NLP engineering experiment, Approaches low compute settings-efficiency, Data resources

Languages Studied: Wolof, French

Submission Number: 6561

Loading