S*: Scaling Test-time Compute for Code Generation

S*: Scaling Test-time Compute for Code Generation

ACL ARR 2025 February Submission6747 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Increasing test-time compute for Large Language Models (LLMs) has demonstrated promising gains across various domains. While this approach has been extensively studied in the math domain, its potential in code generation remains underexplored. In this paper, we propose S*, the first hybrid test-time scaling framework that substantially improves the coverage and selection accuracy of generated code. S* extends the existing parallel scaling paradigm with sequential scaling to push performance boundaries. It further leverages a novel selection mechanism that adaptively generates distinguishing inputs for pairwise comparison, combined with execution-grounded information to robustly identify correct solutions. Evaluation across 12 Large Language Models and Large Reasoning Models of varying sizes demonstrates the generality and superior performance of S*: (1) it consistently improves performance across model families and sizes, enabling a 3B model to outperform GPT-4o-mini; (2) it enables non-reasoning models to surpass reasoning models—GPT-4o-mini with S* outperforms o1-preview by 3.7% on LiveCodeBench; (3) it further boosts state-of-the-art reasoning models — DeepSeek-R1-Distill-Qwen-32B with S* achieves 85.7% on LiveCodeBench, approaching o1 (high) at 88.5\%. Anonymous code is available at https://anonymous.4open.science/r/TestTimeCodeGen-1BB1.

Paper Type: Long

Research Area: NLP Applications

Research Area Keywords: Code Generation, Test Time Scaling, Large Language Models, Large Reasoning Models

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Publicly available software and/or pre-trained models, Data analysis

Languages Studied: English, Python

Submission Number: 6747

Loading