RefactorBench: Evaluating Stateful Reasoning In Language Agents Through Code

Published: 22 Oct 2024, Last Modified: 29 Oct 2024NeurIPS 2024 Workshop Open-World Agents PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Language Agents, Benchmarks, Code Generation, Reasoning, State-Awareness, Refactoring, Long-Horizon Tasks, Knowledge Representation
TL;DR: New benchmark of multi-file code refactors to test language agent reasoning and further explorations into state-awareness to improve performance.
Abstract: Recent advances in language model (LM) agents and function calling have enabled autonomous, feedback-driven systems to solve problems across various digital domains. To better understand the unique limitations of LM agents, we introduce RefactorBench, a benchmark consisting of 100 large handcrafted multi-file refactoring tasks in popular open-source repositories. Solving tasks within RefactorBench requires thorough exploration of dependencies across multiple files and strong adherence to relevant instructions. Every task is defined by 3 natural language instructions of varying specificity and is mutually exclusive, allowing for the chaining of longer pseudo-tasks on the same repository. Baselines on RefactorBench reveal that current LM agents struggle with simple compositional tasks, solving only 22\% of tasks with base instructions, in contrast to a human developer with short time constraints solving 87\%. Through trajectory analysis, we identify various unique failure modes of LM agents, and further explore the failure mode of tracking past actions. By adapting a baseline agent to condition on representations of state, we achieve a 43.9\% improvement in solving RefactorBench tasks. We further extend our state-aware approach to encompass entire digital environments and outline potential directions for future research. RefactorBench aims to support the study of LM agents by providing a set of real-world, multi-hop tasks within the realm of code.
Submission Number: 109
Loading