TL;DR: We use RL to improve the thought process.
Abstract: LLMs are typically trained to answer user questions or follow instructions similarly to how human experts respond. However, in the standard alignment framework they lack the basic ability of explicit thinking before answering. Thinking is important for complex questions that require reasoning and planning -- but can be applied to *any* task. We propose a training method for equipping existing LLMs with such thinking abilities for general instruction following without use of additional human data. We achieve this by an iterative search and optimization procedure that explores the space of possible thought generations, allowing the model to learn how to think without direct supervision. For each instruction, the thought candidates are scored using a judge model to evaluate their responses only, and then optimized via preference optimization. We show that this procedure leads to superior performance on AlpacaEval and Arena-Hard, and shows gains from thinking on non-reasoning categories such as marketing, health and general knowledge, in addition to more traditional reasoning & problem-solving tasks.
Lay Summary: Imagine a super-smart computer program that answers your questions, like a highly knowledgeable expert. Usually, these programs just give you an answer directly. But what if they could "think" things through, just like we do when faced with a tricky problem? Our research introduces a new way to teach these programs to think step-by-step before answering, even for everyday tasks, not just complex ones. We do this without needing more human help. It's like the program learns to brainstorm and refine its own ideas until it comes up with the best response. This "thinking" ability makes these programs much better at understanding and following instructions, leading to more accurate and helpful answers across a wide range of topics, from marketing advice to health questions and tricky puzzles.
Primary Area: Deep Learning->Large Language Models
Keywords: thought generation, CoT, preference learning, RL
Submission Number: 7432
Loading