Advancing Speech In-Context Learning with Semantic and Acoustic Retrieval
Keywords: In-context learning, automatic speech recognition, large multimodal models
TL;DR: We present two retrieval-based methods that improve Speech In-Context Learning for ASR by selecting semantically and acoustically relevant examples, achieving up to 84.7% WER reduction without any model fine-tuning.
Abstract: Recent developments in large multimodal models have opened the door to Speech In-Context Learning (SICL) [2–5, 10], enabling adaptation to diverse speech recognition tasks without fine-tuning. However, the effectiveness of SICL depends on the selection of relevant in-context examples that align with the semantic content and acoustic characteristics of the target utterance [1, 6–9]. We address this challenge through two complementary studies improving SICL example construction for automatic speech recognition (ASR).
We first introduce Text-Embedding KNN for SICL (TICL) [8], a retrieval framework that leverages semantic similarity in the text embedding space to identify high-quality in-context examples for large multimodal models. TICL generates pseudo-transcriptions of a test utterance using a pretrained ASR, encodes them into a lexical embedding space, and retrieves nearest neighbors from a labeled candidate pool using text embeddings. Across accented English, multilingual speech, and children’s speech benchmarks, TICL achieves substantial relative reductions in Word Error Rate (WER), achieving up to 84.7\% improvement over zero-shot baselines, offering robust and generalizable gains without any model fine-tuning. Ablation studies reveal that performance gains saturate rapidly, with as few as four in-context examples sufficient to achieve near-optimal results, and that semantic retrieval consistently outperforms speaker-identity and acoustic alternatives.
Semantic retrieval alone, however, can fall short in acoustically variable domains such as children's speech, where pronunciation patterns deviate systematically from adult norms in a developmental-stage dependent manner, and labeled data remains scarce. Our second study introduces TICL+ [9], which augments TICL with an acoustic reranking stage designed to address this gap. After semantic retrieval, a frozen speech encoder computes acoustic distances between candidates and the target utterance, reranking semantically similar examples to prioritize acoustically similar examples. Across four diverse children’s speech corpora, TICL+ consistently outperforms both zero-shot and baseline TICL methods, achieving up to 53.3\% relative WER reduction versus zero-shot and up to 37.6\% over TICL, demonstrating the complementary value of combining semantic and acoustic signals in low-resource, high-variability settings.
Together, these studies demonstrate that careful retrieval of in-context examples substantially improves SICL performance across various domains. By progressively integrating semantic and acoustic cues into a lightweight, modular pipeline, our work offers a scalable alternative to fine-tuning and a step toward more flexible adaptation of large multimodal models for robust speech recognition.
1] Rishabh Agarwal, et al., editors, Ad-
vances in Neural Information Processing Systems, volume 37, pages 76930–76966. Curran Asso-
ciates, Inc., 2024.
[2] William Chen, et al. Owls: Scaling laws for multilingual speech recognition and translation models. arXiv preprint
arXiv:2502.10373, 2025.
[3] Zhehuai Chen, et al. Salm: Speech-augmented language model with
in-context learning for speech recognition and translation. In ICASSP 2024, pages 13521–13525. IEEE,
2024.
[4] ASR Omnilingual, Gil Keren, et al. Omnilingual asr: Open-source multilingual
speech recognition for 1600+ languages. arXiv preprint arXiv:2511.09690, 2025.
[5] Siyin Wang, et al. Can whisper perform speech-based in-context
learning? In ICASSP 2024, pages 13421–13425, 2024.
[6] Zhao Yang, et al. Representative demonstration selection for in-context
learning with two-stage determinantal point process. In Houda Bouamor, Juan Pino, and Kalika Bali,
editors, Proceedings of EMNLP 2023, pages 5443–5456, Singapore, December 2023. Association for Computational Linguistics.
[7] Zihao Zhao, et al. Calibrate before use: Improving few-shot performance of
language models. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Con-
ference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 12697–
12706. PMLR, 18–24 Jul 2021.
[8] Haolong Zheng, Yekaterina Yegorova, and Mark Hasegawa-Johnson. TICL: Text-embedding knn for
speech in-context learning unlocks speech recognition abilities of large multimodal models. arXiv preprint
arXiv:2509.13395, 2025.
[9] Haolong Zheng, Yekaterina Yegorova, and Mark Hasegawa-Johnson. TICL+: A case study on speech
in-context learning for children’s speech recognition. arXiv preprint arXiv:2512.18263, 2025.
[10] Jiaming Zhou, et al. M2R-Whisper: Multi-stage and multi-scale retrieval augmentation for enhancing whis-
per. In ICASSP 2025-2025
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 54
Loading