Advancing Speech In-Context Learning with Semantic and Acoustic Retrieval

Haolong Zheng; Yekaterina Yegorova; Mark A. Hasegawa-Johnson

Advancing Speech In-Context Learning with Semantic and Acoustic Retrieval

Haolong Zheng, Yekaterina Yegorova, Mark A. Hasegawa-Johnson

Published: 28 Apr 2026, Last Modified: 28 Apr 2026MSLD 2026 PosterEveryoneRevisionsCC BY 4.0

Keywords: In-context learning, automatic speech recognition, large multimodal models

TL;DR: We present two retrieval-based methods that improve Speech In-Context Learning for ASR by selecting semantically and acoustically relevant examples, achieving up to 84.7% WER reduction without any model fine-tuning.

Abstract: Recent developments in large multimodal models have opened the door to Speech In-Context Learning (SICL) [2–5, 10], enabling adaptation to diverse speech recognition tasks without fine-tuning. However, the effectiveness of SICL depends on the selection of relevant in-context examples that align with the semantic content and acoustic characteristics of the target utterance [1, 6–9]. We address this challenge through two complementary studies improving SICL example construction for automatic speech recognition (ASR). We first introduce Text-Embedding KNN for SICL (TICL) [8], a retrieval framework that leverages semantic similarity in the text embedding space to identify high-quality in-context examples for large multimodal models. TICL generates pseudo-transcriptions of a test utterance using a pretrained ASR, encodes them into a lexical embedding space, and retrieves nearest neighbors from a labeled candidate pool using text embeddings. Across accented English, multilingual speech, and children’s speech benchmarks, TICL achieves substantial relative reductions in Word Error Rate (WER), achieving up to 84.7\% improvement over zero-shot baselines, offering robust and generalizable gains without any model fine-tuning. Ablation studies reveal that performance gains saturate rapidly, with as few as four in-context examples sufficient to achieve near-optimal results, and that semantic retrieval consistently outperforms speaker-identity and acoustic alternatives. Semantic retrieval alone, however, can fall short in acoustically variable domains such as children's speech, where pronunciation patterns deviate systematically from adult norms in a developmental-stage dependent manner, and labeled data remains scarce. Our second study introduces TICL+ [9], which augments TICL with an acoustic reranking stage designed to address this gap. After semantic retrieval, a frozen speech encoder computes acoustic distances between candidates and the target utterance, reranking semantically similar examples to prioritize acoustically similar examples. Across four diverse children’s speech corpora, TICL+ consistently outperforms both zero-shot and baseline TICL methods, achieving up to 53.3\% relative WER reduction versus zero-shot and up to 37.6\% over TICL, demonstrating the complementary value of combining semantic and acoustic signals in low-resource, high-variability settings. Together, these studies demonstrate that careful retrieval of in-context examples substantially improves SICL performance across various domains. By progressively integrating semantic and acoustic cues into a lightweight, modular pipeline, our work offers a scalable alternative to fine-tuning and a step toward more flexible adaptation of large multimodal models for robust speech recognition. 1] Rishabh Agarwal, et al., editors, Ad- vances in Neural Information Processing Systems, volume 37, pages 76930–76966. Curran Asso- ciates, Inc., 2024. [2] William Chen, et al. Owls: Scaling laws for multilingual speech recognition and translation models. arXiv preprint arXiv:2502.10373, 2025. [3] Zhehuai Chen, et al. Salm: Speech-augmented language model with in-context learning for speech recognition and translation. In ICASSP 2024, pages 13521–13525. IEEE, 2024. [4] ASR Omnilingual, Gil Keren, et al. Omnilingual asr: Open-source multilingual speech recognition for 1600+ languages. arXiv preprint arXiv:2511.09690, 2025. [5] Siyin Wang, et al. Can whisper perform speech-based in-context learning? In ICASSP 2024, pages 13421–13425, 2024. [6] Zhao Yang, et al. Representative demonstration selection for in-context learning with two-stage determinantal point process. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of EMNLP 2023, pages 5443–5456, Singapore, December 2023. Association for Computational Linguistics. [7] Zihao Zhao, et al. Calibrate before use: Improving few-shot performance of language models. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Con- ference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 12697– 12706. PMLR, 18–24 Jul 2021. [8] Haolong Zheng, Yekaterina Yegorova, and Mark Hasegawa-Johnson. TICL: Text-embedding knn for speech in-context learning unlocks speech recognition abilities of large multimodal models. arXiv preprint arXiv:2509.13395, 2025. [9] Haolong Zheng, Yekaterina Yegorova, and Mark Hasegawa-Johnson. TICL+: A case study on speech in-context learning for children’s speech recognition. arXiv preprint arXiv:2512.18263, 2025. [10] Jiaming Zhou, et al. M2R-Whisper: Multi-stage and multi-scale retrieval augmentation for enhancing whis- per. In ICASSP 2025-2025

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 54

Loading