Chain-of-Thought Reasoning is a Policy Improvement Operator

Hugh Zhang; David Parkes

Chain-of-Thought Reasoning is a Policy Improvement Operator

Hugh Zhang, David Parkes

Published: 28 Oct 2023, Last Modified: 26 Nov 2023Instruction Workshop @ NeurIPS 2023EveryoneRevisionsBibTeX

Keywords: language models, chain-of-thought reasoning, self-learning

TL;DR: Language models can use chain-of-thought reasoning to instruct themselves on how to perform addition.

Abstract: Large language models have astounded the world with fascinating new capabilities. However, they currently lack the ability to teach themselves new skills, relying instead on large amounts of human-generated training data. We introduce SECToR (Self-Education via Chain-of-Thought Reasoning), a proof-of-concept demonstration that language models can teach themselves new skills using chain-of-thought reasoning. During the self-learning loop, SECToR asks models to solve addition problems using chain-of-thought reasoning before training the next version of the model to solve those same problems directly without using such reasoning. This process often results in an improved model which can, when again augmented with chain-of-thought reasoning, solve even harder problems than the original model, allowing the self-learning loop to continue. Language models trained via SECToR autonomously learn to add up to 29-digit numbers without access to any ground truth examples beyond an initial supervised fine-tuning phase consisting only of numbers with 6 or fewer digits. Our central hypothesis is that chain-of-thought reasoning can act as a policy improvement operator, similarly to how Monte-Carlo Tree Search is used in AlphaZero \citep{silver2017mastering}. We hope that this research can lead to new directions in which language models can learn to teach themselves without the need for human demonstrations.

Submission Number: 70

Loading