Two Failures of Self-Consistency in the Multi-Step Reasoning of LLMs

Angelica Chen; Jason Phang; Alicia Parrish; Vishakh Padmakumar; Chen Zhao; Samuel R. Bowman; Kyunghyun Cho

Two Failures of Self-Consistency in the Multi-Step Reasoning of LLMs

Angelica Chen, Jason Phang, Alicia Parrish, Vishakh Padmakumar, Chen Zhao, Samuel R. Bowman, Kyunghyun Cho

Published: 03 Feb 2024, Last Modified: 17 Sept 2024Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large language models (LLMs) have achieved widespread success on a variety of in-context few-shot tasks, but this success is typically evaluated via correctness rather than consistency. We argue that self-consistency is an important criteria for valid multi-step reasoning in tasks where the solution is composed of the answers to multiple sub-steps. We propose two types of self-consistency that are particularly important for multi-step reasoning -- hypothetical consistency (a model's ability to predict what its output would be in a hypothetical other context) and compositional consistency (consistency of a model's final outputs when intermediate sub-steps are replaced with the model's outputs for those steps). We demonstrate that multiple variants of the GPT-3/-4 models exhibit poor consistency rates across both types of consistency on a variety of tasks.

Submission Length: Regular submission (no more than 12 pages of main content)

Changes Since Last Submission: - Edits to formalized definitions of compositional prompts and consistency - Brief explanations of randomization of in-context prompts - Further analysis in Section 3.2 - More detailed captions for Figures 2, 3, and 4 - Description of the methods used to compute 95% confidence interval (nonparametric bootstrapping) - Minor rewording in conclusion - Brief explanation of how the estimate of hypothetical consistency rate may be a lower bound - Footnote explaining why gpt-4 results are in Appendix instead of main text (all other models being compared to are text completion models, whereas gpt-4 is a chat model) - Added caveat that the experimental definition of semantic equivalence relies upon original prompt p - Removed header for Appendix A.1 (since there is only one Appendix section)

Assigned Action Editor: ~Pascal_Poupart2

License: Creative Commons Attribution 4.0 International (CC BY 4.0)

Submission Number: 1559

Loading