Formalizing Test-Time Compute for Function-Level Code Generation

Formalizing Test-Time Compute for Function-Level Code Generation

ACL ARR 2025 July Submission508 Authors

28 Jul 2025 (modified: 03 Sept 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Test-time compute has emerged as a powerful paradigm in function-level code generation. However, previous proposed strategies have been viewed as disparate, thus lacking a fair apples-to-apples analysis enabling understanding of their operational mechanisms in execution-based benchmarks. Therefore, we present a mathematical framework that unifies generation and reranking with theoretical justifications through the lens of Minimum Bayes Risk (MBR) decoding. Our proposed framework leads to key research questions regarding the effectiveness of using parallel and/or iterative sampling, design choices of reranking signals and soft/hard MBR utility functions, and behaviors of the final selected program across different methods. Our empirical findings highlight the importance of the diversity of sampled candidates (over self-improvement), reranking with simple and high-quality signals, and the effectiveness of test-time compute to select programs that manifest general and edge test case robustness. We will open-source our analysis toolkit and implementation to enable reproducible research.

Paper Type: Long

Research Area: NLP Applications

Research Area Keywords: Test-Time Compute, Analysis, Function-Level Code Generation

Contribution Types: Model analysis & interpretability, Theory

Languages Studied: Python (Programming Language)

Previous URL: https://openreview.net/forum?id=YDrw6lS4Du

Explanation Of Revisions PDF: pdf

Reassignment Request Area Chair: No, I want the same area chair from our previous submission (subject to their availability).

Reassignment Request Reviewers: Yes, I want a different set of reviewers

Justification For Not Keeping Action Editor Or Reviewers: We noticed a lack of justified claims for VpeV (e.g., providing statements with details non-existent in weakness 2). We also noticed that reviewer avUN only provide one single weakness without further explanations. Last but not least, we didn't have further replies from any user during the rebuttal. Therefore, we ask for reviewers that we think might devote more time into reading of the paper and discussing with us.

A1 Limitations Section: This paper has a limitations section.

A2 Potential Risks: Yes

A2 Elaboration: Ethical consideration section.

B Use Or Create Scientific Artifacts: Yes

B1 Cite Creators Of Artifacts: N/A

B1 Elaboration: Section 3.5.

B2 Discuss The License For Artifacts: N/A

B2 Elaboration: They are created using public available dataset to prompt open-source LLMs with volunteer post-editing, so we don't think more explanations are needed.

B3 Artifact Use Consistent With Intended Use: No

B3 Elaboration: Don't need intended use, since the dataset only classify function-level simple python unit tests.

B4 Data Contains Personally Identifying Info Or Offensive Content: N/A

B4 Elaboration: Not suitable here as they are python functions with no pii data.

B5 Documentation Of Artifacts: N/A

B5 Elaboration: Section 3.5.

B6 Statistics For Data: No

B6 Elaboration: Don't need explanation as they show the exact same statistics as the original datasets (HumanEval and MBPP)

C Computational Experiments: Yes

C1 Model Size And Budget: Yes

C1 Elaboration: Section 3.2

C2 Experimental Setup And Hyperparameters: Yes

C2 Elaboration: Section 3

C3 Descriptive Statistics: Yes

C3 Elaboration: Section 4

C4 Parameters For Packages: Yes

C4 Elaboration: Section 3 (EvalPlus is used for evaluation)

D Human Subjects Including Annotators: Yes

D1 Instructions Given To Participants: N/A

D1 Elaboration: No instructions suit as annotators only modify for python function correctness.

D2 Recruitment And Payment: No

D2 Elaboration: The workload is minimum (can be done in 4 hours), so we ask PhD students as volunteers to do the job.

D3 Data Consent: No

D3 Elaboration: Data is of MIT liscence.

D4 Ethics Review Board Approval: N/A

D5 Characteristics Of Annotators: N/A

E Ai Assistants In Research Or Writing: Yes

E1 Information About Use Of Ai Assistants: Yes

E1 Elaboration: We only use Cursor to help us writing the code. Don't need further justification.

Author Submission Checklist: yes

Submission Number: 508

Loading