LLM-Generated Black-box Explanations Can Be Adversarially Helpful

Published: 10 Oct 2024, Last Modified: 03 Aug 2025RegML 2024EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Interpretable AI, Adversarial Helpfulness, Black-box Explanations, Persuasion Strategies, AI Trust and Reliability, Large Language Models, Natural Language Inference
TL;DR: This paper investigates how Large Language Models (LLMs) can generate misleading explanations that make incorrect answers appear correct, highlighting the need for improved safeguards in AI explainability.
Abstract: Large language models (LLMs) are becoming vital tools that help us solve and understand complex problems. LLMs can generate convincing explanations, even when given only the inputs and outputs of these problems, i.e., in a ``black-box'' approach. However, our research uncovers a hidden risk tied to this approach, which we call $\textit{adversarial helpfulness}.$ This happens when an LLM's explanations make a wrong answer look correct, potentially leading people to trust faulty solutions. In this paper, we show that this issue affects not just humans, but also LLM evaluators. Digging deeper, we identify and examine key persuasive strategies employed by LLMs. Our findings reveal that these models employ strategies such as reframing questions, expressing an elevated level of confidence, and `cherry-picking' evidence that supports incorrect answers. We further create a symbolic graph reasoning task to analyze the mechanisms of LLMs generating adversarial helpfulness explanations. Most LLMs are not able to find alternative paths along simple graphs, indicating that other mechanisms, rather than logical deductions, might facilitate adversarial helpfulness. These findings shed light on the limitations of black-box explanations and lead to recommendations for the safer use of LLMs.
Submission Number: 14
Loading