Scaling Test-time Compute for LLM Agents

Scaling Test-time Compute for LLM Agents

ACL ARR 2025 July Submission1174 Authors

29 Jul 2025 (modified: 25 Aug 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Scaling test-time compute has shown remarkable success in improving the reasoning abilities of large language models (LLMs). In this work, we conduct the first systematic exploration of applying test-time scaling methods to language agents and investigate the extent to which it improves their effectiveness. Specifically, we explore different test-time scaling strategies, including: (1) parallel sampling algorithms; (2) sequential revision strategies; (3) verifiers and merging methods; (4) strategies for diversifying rollouts. We carefully analyze and ablate the impact of different design strategies on applying test-time scaling on language agents, and have the following findings: 1. Scaling test time compute could improve the performance of agents. 2. Knowing when to reflect is important for agents. 3. Among different verification and result merging approaches, the list-wise method performs best. 4. Increasing diversified rollouts exerts a positive effect on the agent's task performance. All code are available at https://anonymous.4open.science/r/ATTS-D74F.

Paper Type: Long

Research Area: Language Modeling

Research Area Keywords: LLM Agent, GAIA Benchmark; Test Time Scaling

Languages Studied: en

Previous URL: https://openreview.net/forum?id=KQVeYlPXLV

Reassignment Request Area Chair: Yes, I want a different area chair for our submission

Reassignment Request Reviewers: Yes, I want a different set of reviewers

A1 Limitations Section: This paper has a limitations section.

A2 Potential Risks: Yes

A2 Elaboration: Limitations

B Use Or Create Scientific Artifacts: Yes

B1 Cite Creators Of Artifacts: Yes

B1 Elaboration: 1.2.

B2 Discuss The License For Artifacts: Yes

B2 Elaboration: Ethics Statement

B3 Artifact Use Consistent With Intended Use: No

B3 Elaboration: We will provide the specific address in the camera-ready version

B4 Data Contains Personally Identifying Info Or Offensive Content: No

B4 Elaboration: The dataset used in our research is constructed using publicly available data sources, ensuring that there are no privacy concerns or violations.

B5 Documentation Of Artifacts: Yes

B5 Elaboration: 3.

B6 Statistics For Data: Yes

B6 Elaboration: 3.

C Computational Experiments: Yes

C1 Model Size And Budget: Yes

C1 Elaboration: 3.

C2 Experimental Setup And Hyperparameters: Yes

C2 Elaboration: 4.

C3 Descriptive Statistics: Yes

C3 Elaboration: 4.

C4 Parameters For Packages: N/A

D Human Subjects Including Annotators: No

D1 Instructions Given To Participants: N/A

D1 Elaboration: The work of data annotation is not covered in our paper

D2 Recruitment And Payment: N/A

D3 Data Consent: N/A

D4 Ethics Review Board Approval: N/A

D5 Characteristics Of Annotators: N/A

E Ai Assistants In Research Or Writing: No

E1 Information About Use Of Ai Assistants: N/A

E1 Elaboration: we only use it for polish paper

Author Submission Checklist: yes

Submission Number: 1174

Loading