Large Language Model-based Data Science Agent: A Survey

Ke Chen; Peiran Wang; Yaoning Yu; Xianyang Zhan; Haohan Wang

Large Language Model-based Data Science Agent: A Survey

Ke Chen, Peiran Wang, Yaoning Yu, Xianyang Zhan, Haohan Wang

Published: 08 Feb 2026, Last Modified: 08 Feb 2026Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: The rapid advancement of Large Language Models (LLMs) has driven novel applications across diverse domains, with LLM-based agents emerging as a crucial area of exploration. This survey presents a comprehensive analysis of LLM-based agents designed for data science tasks, summarizing insights from recent studies. From the agent perspective, we discuss the key design principles, covering agent roles, execution, knowledge, and reflection methods. From the data science perspective, we identify key processes for LLM-based agents, including data preprocessing, model development, evaluation, visualization, etc. Our work offers two key contributions: (1) a comprehensive review of recent developments in applying LLM-based agents to data science tasks; (2) a dual-perspective framework that connects general agent design principles with the practical workflows in data science.

Submission Type: Long submission (more than 12 pages of main content)

Changes Since Last Submission: We have carefully incorporated all revisions described in our response to the reviewers into the manuscript, and have additionally adjusted the surrounding context where necessary to ensure that the newly added or revised content is coherently integrated into the paper as a whole. Below, we briefly summarize the key changes made in this revision: **1. Substantial Revision of the Introduction with Explicit Research Questions (Page 2)** We revised the Introduction and introduced five progressively structured research questions (RQs) that serve as the conceptual backbone of the paper. These RQs directly correspond to the five major sections of the survey. **2. Addition of Quantitative Analyses to Characterize the Research Landscape and Trends** To address the lack of quantitative analysis in the original version, we added two types of quantitative visualizations: 2.1 Keyword Frequency Cloud (Figure 2, Page 4): We introduce a keyword frequency cloud to illustrate the relative prevalence of core methodological components. Keywords are color-coded according to their corresponding subsections. 2.2 Temporal Evolution of SOTA Performance on Benchmarks (Figure 15, Page 22): We add temporal analyses of state-of-the-art performance on three representative benchmarks, illustrating how benchmark performance evolves over time and how improvements relate to different agent systems and backbone models. **3. Complete Rewrite of the Opening Paragraph of Section 3 (Page 4)** We fully rewrote the opening paragraph of Section 3 to explicitly ground the discussion in the inherent limitations of current LLMs, including hallucination, brittle code generation, and difficulty with long-horizon workflows. This revised introduction motivates why agent-based system design is necessary and reframes Sections 3.1–3.4 as four systematic solution dimensions that directly target these failure modes, rather than as a simple taxonomy of design choices. **4. Addition of Strength–Limitation–Trade-off Analyses, with Cross-Method Comparison (Pages 4–10)** For each method discussed in Section 3.1, we added a dedicated concluding paragraph that analyzes: * strengths and benefits, * limitations and failure modes, * trade-offs among reliability, scalability, and coordination cost, and * suitability for real-world and industrial deployment scenarios. In addition, we introduced a cross-method trade-off summary table (Table 3, Page 10) that enables direct horizontal comparison across methods. **5. Rewriting Section 5 (Benchmarks) from a Listing to a Critical Analysis (Page 21)** We substantially revised Section 5 to move beyond a catalog-style presentation of benchmarks. The revised section now provides a critical analysis of existing benchmark suites by: * summarizing their respective strengths and intended evaluation focus, * identifying three common structural limitations shared across many benchmarks (e.g., evaluation of isolated steps, rigid task formats, and overly clean inputs), and * emphasizing the necessity of a multi-benchmark, complementary evaluation perspective to more accurately assess real-world agent capabilities. **6. Substantial Revision of Section 6 (Future Research Opportunities) to Align with Identified Limitations (Pages 21–22)** We rewrote Section 6 to ensure that future research directions are explicitly motivated by the limitations identified in Sections 3 and 5. Specifically: 6.1 We removed two previous subsections: * Multimodal Processing (not directly motivated by the limitations proposed earlier); * Trainable Architecture (has since been explored by recent works). 6.2 We refined Advanced Reflection Mechanisms into a more precise and operational direction: Pipeline-Level Reflection, focusing on long-horizon, cross-stage error attribution and correction. 6.3 We introduced two new research directions directly derived from the identified limitations: * Data-Centric Diagnostics, addressing failures caused by data irregularities, schema mismatches, and distributional issues that are invisible to current reflection mechanisms; * Uncertainty-Aware Workflow Planning, emphasizing the need for agents to model uncertainty and adapt execution strategies to prevent error propagation.

Assigned Action Editor: ~Stefan_Lee1

Submission Number: 5567

Loading