GPT-4 Doesn’t Know It’s Wrong: An Analysis of Iterative Prompting for Reasoning Problems

Published: 07 Nov 2023, Last Modified: 04 Dec 2023FMDM@NeurIPS2023EveryoneRevisionsBibTeX
Keywords: Large Language Models, Self-Critique, Reasoning, Graph Coloring Problem, GPT-4
TL;DR: We find that LLM self-critiquing fails in multiple ways on a canonical reasoning problem, and we analyze the details.
Abstract: There has been considerable divergence of opinion on the reasoning abilities of Large Language Models (LLMs). While the initial optimism that reasoning might emerge automatically with scale has been tempered thanks to a slew of counterexamples--ranging from multiplication to simple planning, there is still the wide spread belief that LLMs can self-critique and improve their own solutions in an iterative fashion. This belief seemingly rests on the assumption that verification of correctness should be easier than generation--a rather classical argument from computational complexity, that should be irrelevant to LLMs to the extent what they are doing is approximate retrieval. In this paper, we set out to systematically investigate the effectiveness of iterative prompting of LLMs in the context of {\em Graph Coloring}, a canonical NP-complete reasoning problem that is related to propositional satisfiability as well as practical problems like scheduling and allocation. We present a principled empirical study of the performance of GPT4 in solving graph coloring instances or verifying the correctness of candidate colorings--both in direct and iterative modes. In iterative modes, we experiment both with the model critiquing its own answers and an external correct reasoner verifying proposed solutions. In both cases, we analyze whether the content of the criticisms actually affects bottom line performance. The study seems to indicate that (i) LLMs are bad at solving graph coloring instances (ii) they are no better at verifying a solution--and thus are not effective in iterative modes with LLMs critiquing LLM-generated solutions (iii) the correctness and content of the criticisms--whether by LLMs or external solvers--seems largely irrelevant to the performance of iterative prompting. We show that the observed effectiveness of LLMs in iterative settings is largely due to the correct solution being fortuitously present in the top-k completions of the prompt (and being recognized as such by an external verifier). Our results thus call into question claims about the self-critiquing capabilities of state of the art LLMs.
Submission Number: 82
Loading