Keywords: Inference-time computation, LLM reasoning, Benchmark of LLM reasoning
Abstract: With the advancement of large language models (LLMs), solving complex tasks (e.g., math problems, code generation, etc.) has garnered increasing attention. Inference-time computation methods (e.g., Best-of-N, MCTS, etc.) are of significant importance, as they have the potential to enhance the reasoning capabilities of LLMs without requiring external training computation. However, due to the inherent challenges of this technique, most existing methods remain proof-of-concept and are not yet sufficiently effective. In this paper, we investigate and benchmark strategies for improving inference-time computation across a wide range of reasoning tasks. Since most current methods rely on a pipeline that first generates candidate solutions (e.g., generating chain-of-thought candidate solutions) and then selects them based on specific reward signals (e.g., RLHF reward, process reward, etc.), our research focuses on strategies for both candidate solution generation (e.g., instructing prompts, hyperparameters: temperature and top-p, etc.) and reward mechanisms (e.g., self-evaluation, reward types, etc.). The experimental results reveal that several previously overlooked strategies can be critical for the success of inference-time computation (e.g., simplifying the temperature can improve general reasoning task performance by up to 5%). Based on extensive experiments (more than 20,000 A100-80G GPU hours with over 1,000 experiments) across a variety of models (e.g., Llama, Qwen, and Mistral families) of various sizes, our proposed strategies outperform the baseline by a substantial margin in most cases, providing a stronger foundation for future research.
Primary Area: Machine learning approaches to data and benchmarks enrichment, augmentation and processing (supervised, unsupervised, online, active, fine-tuning, RLHF, SFT, alignment, etc.)
Submission Number: 1651
Loading