Large Language Models for Data Science: A Survey

ACL ARR 2025 February Submission6225 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Data science is an interdisciplinary field that focuses on extracting knowledge from raw data using statistical analysis and machine learning techniques. However, as data continues to grow in scale and complexity, data scientists face increasing challenges in handling unstructured data, automating workflows, and scaling analytical processes. The advancements of large language models (LLMs) present an unprecedented opportunity to enhance and streamline data science tasks by enabling automation and augmentation of key processes in the data science pipeline. This survey contributes to four core aspects: the role of LLMs in the data science cycle, specialized domain applications, challenges and limitations, and social impact and future directions. Furthermore, we introduce a structured framework defining how LLMs contribute to each stage of data science, provide an in-depth discussion on their applications in key domains such as healthcare and finance, analyze key obstacles such as data quality and model interpretability, and explore ethical concerns and future research opportunities. Serving as a comprehensive resource, this survey aims to assist researchers and practitioners in understanding and utilizing LLMs to advance modern data science methodologies.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: Large Language Models, Data Science
Contribution Types: Surveys
Languages Studied: English
Submission Number: 6225
Loading