Demystifying Representation Alignment in Multilingual and Multimodal Aspects of Large Audio Language Models
Abstract: The mechanistic understanding of large language models (LLMs) has facilitated advancements in controllable generation, knowledge editing, model stitching, and other foundational techniques. However, the behavior of LLMs in multimodal multilingual contexts remains largely unexplored, despite their increasing complexity. This paper investigates how large audio language models (LALMs) process and represent language, modality, and speaker demography. Through a series of experiments, we analyze the latent representations extracted from diverse input cases using two state-of-the-art open-weight LALMs: Ultravox 0.5 and Qwen2 Audio. Our study examines patterns in these representations to uncover the processing mechanisms of LALMs across seven languages and two modalities (text and speech). Additionally, we explore paralinguistic speech features such as gender, age, and accents, as well as acoustic features arising from variations in the recording setup. By bridging the gap in understanding LALMs, this work provides insights into their behavior and lays the groundwork for future research in this critical area.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: Mechanistic, LALM, Multimodal, Multilingual
Contribution Types: Model analysis & interpretability
Languages Studied: English, French, German, Chinese, Japanese, Indonesian, Vietnamese
Submission Number: 4550
Loading