Abstract: Scaling test-time compute has shown remarkable success in improving the reasoning abilities of large language models (LLMs).
In this work, we conduct the first systematic exploration of applying test-time scaling methods to language agents and investigate the extent to which it improves their effectiveness.
Specifically, we explore different test-time scaling strategies, including: (1) parallel sampling algorithms; (2) sequential revision strategies; (3) verifiers and merging methods; (4) strategies for diversifying rollouts.
We carefully analyze and ablate the impact of different design strategies on applying test-time scaling on language agents, and have the following findings:
1. Scaling test time compute could improve the performance of agents.
2. Knowing when to reflect is important for agents.
3. Among different verification and result merging approaches, the list-wise method performs best.
4. Increasing diversified rollouts exerts a positive effect on the agent's task performance. All code are available at https://anonymous.4open.science/r/ATTS-D74F.
Paper Type: Long
Research Area: Language Modeling
Research Area Keywords: LLM Agent, GAIA Benchmark; Test Time Scaling
Languages Studied: en
Previous URL: https://openreview.net/forum?id=KQVeYlPXLV
Reassignment Request Area Chair: Yes, I want a different area chair for our submission
Reassignment Request Reviewers: Yes, I want a different set of reviewers
A1 Limitations Section: This paper has a limitations section.
A2 Potential Risks: Yes
A2 Elaboration: Limitations
B Use Or Create Scientific Artifacts: Yes
B1 Cite Creators Of Artifacts: Yes
B1 Elaboration: 1.2.
B2 Discuss The License For Artifacts: Yes
B2 Elaboration: Ethics Statement
B3 Artifact Use Consistent With Intended Use: No
B3 Elaboration: We will provide the specific address in the camera-ready version
B4 Data Contains Personally Identifying Info Or Offensive Content: No
B4 Elaboration: The dataset used in our research is constructed using publicly available data sources, ensuring that there are no privacy concerns or violations.
B5 Documentation Of Artifacts: Yes
B5 Elaboration: 3.
B6 Statistics For Data: Yes
B6 Elaboration: 3.
C Computational Experiments: Yes
C1 Model Size And Budget: Yes
C1 Elaboration: 3.
C2 Experimental Setup And Hyperparameters: Yes
C2 Elaboration: 4.
C3 Descriptive Statistics: Yes
C3 Elaboration: 4.
C4 Parameters For Packages: N/A
D Human Subjects Including Annotators: No
D1 Instructions Given To Participants: N/A
D1 Elaboration: The work of data annotation is not covered in our paper
D2 Recruitment And Payment: N/A
D3 Data Consent: N/A
D4 Ethics Review Board Approval: N/A
D5 Characteristics Of Annotators: N/A
E Ai Assistants In Research Or Writing: No
E1 Information About Use Of Ai Assistants: N/A
E1 Elaboration: we only use it for polish paper
Author Submission Checklist: yes
Submission Number: 1174
Loading