Parameter-Efficient Cross-Language Transfer Learning for a Language-Modular Audiovisual Speech Recognition

Zhengyang Li, Thomas Graave, Jing Liu, Timo Lohrenz, Siegfried Kunzmann, Tim Fingscheidt

Published: 01 Jan 2023, Last Modified: 17 Oct 2025ASRU 2023EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: In audiovisual speech recognition (AV-ASR), for many languages only few audiovisual data is available. Building upon an English model, in this work, we first apply and analyze various adapters for cross-language transfer learning to build a parameter-efficient and easy-to-extend AV-ASR in multiple languages. Fine-tuning only the bottleneck adapter with 4% of encoder’s parameters and the decoder shows comparable performance to full fine-tuning in French and Spanish AV-ASR. Second, we investigate the effectiveness of various encoder components in cross-language transfer learning. Our proposed modular linguistic transfer learning approach excels the full fine-tuning method for German, French, and Spanish AV-ASR in almost all clean and noisy conditions (8/9). On low-resourced German AV data (13h), our proposed linguistic transfer learning achieves a 4.1% abs. WER reduction on average for clean and noisy speech, while fine-tuning only 50% of the encoder’s parameters. Our code is at GitHub.11https://github.com/ifnspaml/Cross_Language_Transfer_Learning_AVASR.git