Data Interpreter: An LLM Agent for Data Science

ACL ARR 2025 February Submission8155 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large Language Model (LLM)-based agents have excelled in various domains but face significant challenges when applied to data science workflows due to their complex, multi-stage nature. Current LLM-based agents struggle with non-linear relationships, recursive dependencies, implicit data- and logic-dependent reasoning, and managing extensive context. In this paper, we introduce Data Interpreter, an LLM-based agent that addresses these challenges through hierarchical graph-based modeling to represent the complexity and a progressive strategy for step-by-step verification, refinement, and consistent context management. Extensive experiments confirm the effectiveness of Data Interpreter. On InfiAgent-DABench, it boosts performance by 25\% (from 75.9\% to 94.9\%), and on machine learning and open-ended tasks, it lifts accuracy from 88\% to 95\% and from 60\% to 97\%, respectively. Moreover, our method surpasses state-of-the-art baselines by 26\% on the MATH dataset. We will release the code upon publication.
Paper Type: Long
Research Area: Machine Learning for NLP
Research Area Keywords: Data Science, LLMs, Reasoning
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data analysis
Languages Studied: English
Submission Number: 8155
Loading