What do MLLMs hear? Examining the interaction between LLM and audio encoder components in Multimodal Large Language Models

Enis Berk Çoban; Michael I Mandel; Johanna Devaney

What do MLLMs hear? Examining the interaction between LLM and audio encoder components in Multimodal Large Language Models

Enis Berk Çoban, Michael I Mandel, Johanna Devaney

Published: 10 Oct 2024, Last Modified: 30 Oct 2024Audio Imagination: NeurIPS 2024 WorkshopEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal Large Language Models, audio encoder, multimodal reasoning

Abstract: Large Language Models (LLMs) have demonstrated remarkable reasoning capabilities, notably in connecting ideas and adhering to logical rules to solve problems. These models have evolved to accommodate various data modalities, including sound and images, known as multimodal LLMs (MLLMs), which are capable of generating descriptions of images or sound recordings. We evaluate how MLLMs separate representation of auditory and textual information may sever the reasoning pathway between the audio encoder and the LLM component. Through a captioning-based classification experiment with similar and hierarchical textual relationships, we demonstrate that audio MLLMs cannot fully leverage their LLMs' text-based reasoning when generating audio captions.

Submission Number: 5

Loading