Training Chain-of-Thought via Latent-Variable Inference

Du Phan; Matthew Douglas Hoffman; david dohan; Sholto Douglas; Tuan Anh Le; Aaron T Parisi; Pavel Sountsov; Charles Sutton; Sharad Vikram; Rif A. Saurous

Training Chain-of-Thought via Latent-Variable Inference

Du Phan, Matthew Douglas Hoffman, david dohan, Sholto Douglas, Tuan Anh Le, Aaron T Parisi, Pavel Sountsov, Charles Sutton, Sharad Vikram, Rif A. Saurous

Published: 21 Sept 2023, Last Modified: 16 Jan 2024NeurIPS 2023 posterEveryoneRevisionsBibTeX

Keywords: Large language models, latent-variable models, control variates, chain-of-thought, MCMC

TL;DR: Treating chain-of-thought-prompted question-answering LLMs as probabilistic latent-variable models lets us derive a principled, simple, effective way of tuning them to generate better rationales without training on human-generated rationales.

Abstract: Large language models (LLMs) solve problems more accurately and interpretably when instructed to work out the answer step by step using a "chain-of-thought" (CoT) prompt. One can also improve LLMs' performance on a specific task by supervised fine-tuning, i.e., by using gradient ascent on some tunable parameters to maximize the average log-likelihood of correct answers from a labeled training set. Naively combining CoT with supervised tuning requires supervision not just of the correct answers, but also of detailed rationales that lead to those answers; these rationales are expensive to produce by hand. Instead, we propose a fine-tuning strategy that tries to maximize the \emph{marginal} log-likelihood of generating a correct answer using CoT prompting, approximately averaging over all possible rationales. The core challenge is sampling from the posterior over rationales conditioned on the correct answer; we address it using a simple Markov-chain Monte Carlo (MCMC) expectation-maximization (EM) algorithm inspired by the self-taught reasoner (STaR), memoized wake-sleep, Markovian score climbing, and persistent contrastive divergence. This algorithm also admits a novel control-variate technique that drives the variance of our gradient estimates to zero as the model improves. Applying our technique to GSM8K and the tasks in BIG-Bench Hard, we find that this MCMC-EM fine-tuning technique typically improves the model's accuracy on held-out examples more than STaR or prompt-tuning with or without CoT.

Submission Number: 15280

Loading