Acoustic Prompt Tuning: Empowering Large Language Models with Audition Capabilities

23 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: representation learning for computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Large language model; audio understanding; multi-task training; in-context training; audio-visual learning
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Abstract: The auditory system plays a substantial role in shaping the overall human perceptual experience. While prevailing large language models (LLMs) and visual language models (VLMs) have shown their promise in solving a wide variety of vision and language understanding tasks through their massive open-world knowledge and inter-task homogeneity, only a few of them can be generalised to the audio domain without compromising their domain-specific capacity. Meanwhile, a majority of existing multimodal foundation models (e.g., VLMs) structure their input sequences as [Multimedia, Question, Answer], constraining their applicability to more comprehensive tasks, such as natural language visual reasoning. In this work, we introduce Acoustic Prompt Turning (APT), an acoustic adapter leveraging a multi-task learning framework to extend LLMs and VLMs to the audio domain. APT uses an instruction-aware aligner to acquire a fixed number of acoustic embeddings by cross-attending audio feature maps generated from an audio encoder. Diverse audio-related tasks are formulated in a sequence-to-sequence manner without imposing any constraints on input sequences, and therefore, allowing APT to be seamlessly trained by combining the present multi-task learning with in-context learning. Experiments show that LLMs coupled with APT (namely APT-LLMs) achieve competitive results compared to the expert models (i.e., the networks trained on the targeted datasets) across various tasks. Additionally, we evaluate APT-LLMs to a novel audio reasoning task involving comparative analysis and summarisation of two audio clips. We also demonstrate the ability of APT to extend frozen VLMs to the audio domain, yielding promising results in the audio-visual understanding task even without finetuned on any audio-visual datasets. Our code and model weights will be released.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 7454
Loading