AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language Models

Kai Li; Can Shen; Yile Liu; Jirui Han; Kelong zheng; Xuechao Zou; Lionel Z. WANG; Shun Zhang; Xingjian Du; Hanjun Luo; Yingbin Jin; Xinxin Xing; Ziyang Ma; Yue Liu; YiFan Zhang; Junfeng Fang; Kun Wang; Yibo Yan; Gelei Deng; Haoyang LI; Yiming Li; Xiaobin Zhuang; Tianlong Chen; Qingsong Wen; Tianwei Zhang; Yang Liu; Haibo Hu; Zhizheng Wu; Xiaolin Hu; Eng Siong Chng; Wenyuan Xu; XiaoFeng Wang; Wei Dong; Xinfeng Li

AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language Models

Published: 08 Nov 2025, Last Modified: 08 Nov 2025ResponsibleFM @ NeurIPS 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Audio Large Language Models, trustworthiness evaluation, dataset

Abstract: The rapid development and widespread adoption of Audio Large Language Models (ALLMs) require a rigorous assessment of their trustworthiness. However, existing evaluation frameworks, primarily designed for text, are not equipped to handle the unique vulnerabilities introduced by audio’s acoustic properties. We find that significant trustworthiness risks in ALLMs arise from non-semantic acoustic cues, such as timbre, accent, and background noise, which can be used to manipulate model behavior. To address this gap, we propose AudioTrust, the first framework for large-scale and systematic evaluation of ALLM trustworthiness concerning these audio-specific risks. AudioTrust spans six key dimensions: fairness, hallucination, safety, privacy, robustness, and authenticition. It is implemented through 26 distinct sub-tasks and a curated dataset of over 4,420 audio samples collected from real-world scenarios (e.g., daily conversations, emergency calls, and voice assistant interactions), purposefully constructed to probe the trustworthiness of ALLMs across multiple dimensions. Our comprehensive evaluation includes 18 distinct experimental configurations and employs human-validated automated pipelines to objectively and scalably quantify model outputs. Experimental results reveal the boundaries and limitations of 14 state-of-the-art (SOTA) open-source and closed-source ALLMs when confronted with diverse high-risk audio scenarios, thereby offering critical insights into the secure and trustworthy deployment of future audio models. Our platform and benchmark are publicly available at https://anonymous.4open.science/r/AudioTrust-8715/.

Submission Number: 73

Loading

AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language Models

Kai Li, Can Shen, Yile Liu, Jirui Han, Kelong zheng, Xuechao Zou, Lionel Z. WANG, Shun Zhang, Xingjian Du, Hanjun Luo, Yingbin Jin, Xinxin Xing, Ziyang Ma, Yue Liu, YiFan Zhang, Junfeng Fang, Kun Wang, Yibo Yan, Gelei Deng, Haoyang LI et al. (14 additional authors not shown)