Keywords: Trustworthy Machine Learning, Explainability, Interpretability, Faithfulness, Large Language Models
TL;DR: We explore approaches to improve faithfulness of CoT reasoning generated from large language models, and present their shortcomings.
Abstract: As Large Language Models (LLMs) are being increasingly employed in critical domains such as healthcare, it is essential to make these models trustworthy. In this pursuit, Chain-of-Thought (CoT) prompting has emerged as a potential source of transparency in LLMs. While CoT reasoning is appealing to humans, prior studies have shown that these reasoning chains are not faithful i.e.; they do not accurately reflect the underlying LLM's behavior. Ensuring the faithfulness of LLM-generated CoT reasoning is crucial for decision-makers, who rely on them to determine if, when, and to what extent, trust the recommendations made by these models. While several works proposed strategies to enhance accuracy and truthfulness in LLMs, there has been a lack of exploration on the effectiveness of these common strategies to enhance the faithfulness of chain-of-thought (CoT) reasoning. Specifically, we explore the promise of in-context learning, fine-tuning, and activation editing to improve the faithfulness of the CoT reasoning. Our empirical analyses on benchmark tasks indicate that these strategies offer limited success in improving the faithfulness of the CoT reasoning, with only slight performance enhancements in controlled scenarios. Activation editing demonstrated minimal success, while fine-tuning and in-context learning achieved marginal improvements that failed to generalize across reasoning and truthful question-answering benchmarks. We subsequently analyse what makes faithful CoT reasoning challenging, and present findings to lay the groundwork for future research in trustworthy reasoning from LLMs. In summary, our work underscores the inherent difficulty in eliciting faithful CoT reasoning from LLMs, suggesting that the current array of approaches may not be sufficient to address this challenge.
Supplementary Material: zip
Primary Area: interpretability and explainable AI
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 12455
Loading