Toward Explanations for Large Language Models in Natural Language

Toward Explanations for Large Language Models in Natural Language

TMLR Paper5749 Authors

27 Aug 2025 (modified: 28 Sept 2025)Withdrawn by AuthorsEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large Language Models (LLMs) have become proficient in addressing complex tasks by leveraging their extensive internal knowledge and reasoning capabilities. However, the black-box nature of these models complicates the task of explaining their decision-making processes. While recent advancements demonstrate the potential of leveraging LLMs to self-explain their predictions through natural language (NL) explanations, their explanations or chain-of-thoughts may not accurately reflect the LLMs' decision-making process due to a lack of true decision-making pivots involved. Measuring the fidelity of NL explanations is a challenging but important issue, as it is difficult to manipulate the input context to mask the semantics of these explanations, but it can effectively assess the quality of explanations. To this end, we introduce FaithLM for explaining the decision of LLMs with NL explanations. Specifically, FaithLM designs a method for evaluating the fidelity of NL explanations by incorporating the contrary explanations into the query process. Moreover, FaithLM conducts an iterative process to improve the fidelity of derived explanations. Experiment results on three datasets from multiple domains demonstrate that FaithLM can significantly improve the fidelity of derived explanations, which also provides a better alignment with the ground-truth explanations.

Submission Length: Regular submission (no more than 12 pages of main content)

Previous TMLR Submission Url: https://openreview.net/forum?id=JGdfTdEH3g&referrer=%5BAuthor%20Console%5D(%2Fgroup%3Fid%3DTMLR%2FAuthors%23your-submissions)

Assigned Action Editor: ~Shahin_Jabbari1

Submission Number: 5749

Loading