AudioAgent: Enhancing Task Performance through Modality-Driven Prompt Optimization

AudioAgent: Enhancing Task Performance through Modality-Driven Prompt Optimization

ACL ARR 2024 June Submission879 Authors

13 Jun 2024 (modified: 03 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large Language Models (LLMs) have achieved remarkable progress in serving as controllers to interpret user instructions and select models for audio tasks. However, current LLMs, when selecting tools, only consider the textual input, neglecting valuable information within the audio modality that could aid in choosing appropriate tools. Due to the possible ambiguity of instructions, errors in selection are common. To this end, we introduce AudioAgent, a versatile and adaptable agent framework for audio fields. It is the first system that emphasizes audio comprehension and utilizes these information to autonomously refine user-provided prompt by one finetuned LLM. Through clearer instructions, AudioAgent empowers the controller to precisely select the best tools and enhances the performance of tasks. Our framework also enables users to freely register tools and utilize any LLM as the core controller. Both subjective and objective metrics validate the effectiveness of our work. Result samples are available at https://AudioAgentTool.github.io.

Paper Type: Long

Research Area: Dialogue and Interactive Systems

Research Area Keywords: Dialogue and Interactive Systems; Speech Recognition, Text-to-Speech and Spoken Language Understanding

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches to low-resource settings

Languages Studied: English

Submission Number: 879

Loading