Planning in Natural Language Improves LLM Search for Code Generation

Evan Z Wang; Federico Cassano; Catherine Wu; Yunfeng Bai; William Song; Vaskar Nath; Ziwen Han; Sean M. Hendryx; Summer Yue; Hugh Zhang

Planning in Natural Language Improves LLM Search for Code Generation

Evan Z Wang, Federico Cassano, Catherine Wu, Yunfeng Bai, William Song, Vaskar Nath, Ziwen Han, Sean M. Hendryx, Summer Yue, Hugh Zhang

Published: 10 Oct 2024, Last Modified: 28 Oct 2024Sys2-Reasoning PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM, inference-time, search, inference-time compute, NLP, competitive programming, reasoning, code generation, pass@k, diversity

TL;DR: Searching in natural language space rather than code space induces diversity in generated outputs, which drastically increases the effectiveness of inference-time compute.

Abstract: While scaling training compute has led to remarkable improvements in large lan- guage models (LLMs), scaling inference compute has not yet yielded analogous gains. We hypothesize that a core missing component is a lack of diverse LLM out- puts, leading to inefficient search due to models repeatedly sampling highly similar, yet incorrect generations. We empirically demonstrate that this lack of diversity can be mitigated by searching over candidate plans for solving a problem in natural lan- guage. Based on this insight, we propose PlanSearch, a novel search algorithm which shows strong results across HumanEval+, MBPP+, and LiveCodeBench (a contamination-free benchmark for competitive coding). PlanSearch generates a diverse set of observations about the problem and uses these observations to con- struct plans for solving the problem. By searching over plans in natural language rather than directly over code solutions, PlanSearch explores a significantly more diverse range of potential solutions compared to baseline search methods. Using PlanSearch on top of Claude 3.5 Sonnet achieves a pass@200 of 77.0% on LiveCodeBench, outperforming both the best pass-rate achieved without any search (pass@1 = 41.4%) and using standard repeated sampling on top of existing non-search models (pass@200 = 60.6%). Finally, we show that, across all models, search algorithms, and benchmarks analyzed, we can accurately predict perfor- mance gains from search as a function of the diversity over generated ideas. Code can be found at https://github.com/scaleapi/plansearch.

Submission Number: 47

Loading