What Makes Chain-of-Thought Prompting Effective? A Counterfactual Study

Aman Madaan; Katherine Hermann; Amir Yazdanbakhsh

What Makes Chain-of-Thought Prompting Effective? A Counterfactual Study

Aman Madaan, Katherine Hermann, Amir Yazdanbakhsh

Published: 07 Oct 2023, Last Modified: 01 Dec 2023EMNLP 2023 FindingsEveryoneRevisionsBibTeX

Submission Type: Regular Long Paper

Submission Track: Language Modeling and Analysis of Language Models

Submission Track 2: Interpretability, Interactivity, and Analysis of Models for NLP

Keywords: LLMs, Analysis, Chain-of-thought, Reasoning, Prompting, Few-shot Reasoning

TL;DR: Chain-of-thought prompting's primary strength lies in its ability to succinctly express the necessary tasks, thereby guiding the models more effectively.

Abstract: The effectiveness of Chain-of-thought prompting (CoT) has been widely recognized, but the underlying mechanisms behind its success, the reason why it just works for a wide range of tasks, remains an open question. To investigate this, we employ a counterfactual prompting approach, systematically manipulating elements of examples used in a few-shot prompt, and testing the consequences on model behavior. This allows us to understand the relative contributions of prompt elements such as symbols (digits, entities) and patterns (equations, sentence structure) on in-context learning. Our experiments with three different large language models (LLMs) reveal several key findings. First, the specific symbols used in the prompt do not significantly impact the model's performance. However, consistent patterns in examples and specifying text in style frequently found on the web are crucial. Second, our findings suggest that the necessity of accurate few-shot examples depends on their role in communicating task understanding. We identify tasks where inaccurate few-shot examples hurt and, surprisingly, tasks where they improve performance. Additionally, we find that the intermediate steps in CoT may not necessarily facilitate learning how to solve a task, but instead efficiently convey task understanding (what) to the model. Furthermore, CoT leverages LLMs to fill in missing commonsense information, particularly helping difficult reasoning problems and long-tail questions.

Submission Number: 2679

Loading