Demystifying Representation Alignment in Multilingual and Multimodal Aspects of Large Audio Language Models

Demystifying Representation Alignment in Multilingual and Multimodal Aspects of Large Audio Language Models

ACL ARR 2025 May Submission4550 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: The mechanistic understanding of large language models (LLMs) has facilitated advancements in controllable generation, knowledge editing, model stitching, and other foundational techniques. However, the behavior of LLMs in multimodal multilingual contexts remains largely unexplored, despite their increasing complexity. This paper investigates how large audio language models (LALMs) process and represent language, modality, and speaker demography. Through a series of experiments, we analyze the latent representations extracted from diverse input cases using two state-of-the-art open-weight LALMs: Ultravox 0.5 and Qwen2 Audio. Our study examines patterns in these representations to uncover the processing mechanisms of LALMs across seven languages and two modalities (text and speech). Additionally, we explore paralinguistic speech features such as gender, age, and accents, as well as acoustic features arising from variations in the recording setup. By bridging the gap in understanding LALMs, this work provides insights into their behavior and lays the groundwork for future research in this critical area.

Paper Type: Long

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: Mechanistic, LALM, Multimodal, Multilingual

Contribution Types: Model analysis & interpretability

Languages Studied: English, French, German, Chinese, Japanese, Indonesian, Vietnamese

Submission Number: 4550

Loading