SmartDS-Solver: Agentic AI for Vertical Domain Problem Solving in Data Science

05 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Agentic AI, Auto ML, Hyperparameter Tuning, NLP, Code Generation from Natural Language, Planning with Language Models
Abstract: Automating complex, multi-step vertical domain tasks—such as Data Science (DS) workflows—presents significant challenges for large language model (LLM) agents. Existing AutoDS approaches often rely on prompt-sensitive, fragmented multi-turn interactions and costly full re-generation upon execution failure, leading to unstable workflow coherence and high token consumption. We introduce SmartDS-Solver, a reasoning-centric agentic system designed to enhance the stability, robustness, and cost efficiency of these workflows. Our core approach integrates rigorous workflow planning into a domain-specialized Reasoning LLM, which is trained using structured methodological distillation and a two-stage Group Relative Policy Optimization (GRPO) procedure. Crucially, SmartDS-Solver employs a lightweight agentic layer featuring the novel State-Aware Refinement and Temperature Exploration (SARTE) algorithm. SARTE dynamically adjusts the LLM’s decoding strategy based on deterministic execution feedback, enabling minimally invasive patching rather than costly full re-planning. We performed a comprehensive evaluation across 32 datasets covering 11 MLE-Bench tasks, 18 AutoML-Agent benchmarks, and 3 real-world tasks, showing consistent gains while reducing inference and modification token usage. In the MLE-Bench benchmark, our 32B model attains an 81.8% win rate over the AIDE+o1-preview baseline, and on the 18 AutoML-Agent tasks, the win rate reaches 94%. Notably, even a 7B model produces fully executable solutions on all evaluated tasks, demonstrating the scalability and robustness of our method. SmartDS-Solver reduces token usage by approximately 78% on the 11 MLE-Bench tasks. The SARTE meta-control mechanism significantly boosts decoding performance—raising average accuracy by 3.9%, lowering error rates by 12%, and delivering an overall 75% significant improvement on MLE-Bench tasks (p = 0.0173).
Primary Area: infrastructure, software libraries, hardware, systems, etc.
Submission Number: 2372
Loading