Accelerating Pre-Trained Speech Foundation Model Deployment Using Randomly Recursive Transformers

Eungbeom Kim; Seaone Ok; Min Jun Choi; Kyogu Lee

Accelerating Pre-Trained Speech Foundation Model Deployment Using Randomly Recursive Transformers

Eungbeom Kim, Seaone Ok, Min Jun Choi, Kyogu Lee

20 Sept 2025 (modified: 25 Sept 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: speech foundation models, knowledge distillation, model compression

TL;DR: This paper studies layer sharing for speech foundation model compression. In particular, we introduce Randomly Recursive Transformers and its training method under a low-resource environment .

Abstract: Compressing pre-trained speech foundation models has been studied to address the high computational costs of the large-scale models. To achieve this goal, knowledge distillation is a widely applied technique that reduces the width or depth of the Transformer architectures. However, restricting the number of parameters leads to decreased model capability, which significantly limits the performance of complex speech processing tasks such as automatic speech recognition and phoneme recognition. In this study, we explore a layer sharing method for speech foundation model distillation, in which the layers are recursively shared across Transformer stacks, thereby reducing parameters while preserving performance. Furthermore, we introduce Randomly Recursive Transformers distilled on random recursions. Due to the randomness of recursion, the distilled Randomly Recursive Transformer can facilitate fine-tuning of various depth models along with recursion, unlike previous distillation methods that delay model deployment by demanding repetitive architecture-wise training for each resource requirement. To train Randomly Recursive Transformers, we propose a practical low-resource training method, stochastic batch advancing, which can train a random recursion model under limited computation. We experimentally verify the efficacy of layer recursion on various speech processing tasks using SUPERB by achieving significant performance improvements. We also demonstrate that our method can fine-tune multiple automatic speech recognition models with varying recursion using a single distillation process.

Primary Area: foundation or frontier models, including LLMs

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2026/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 25547

Loading