TL;DR: We present FT-Dojo and FT-Agent to enable and evaluate LLM agents that autonomously perform end-to-end fine-tuning by jointly optimizing data curation and training through closed-loop iteration.
Abstract: Fine-tuning large language models for vertical domains remains labor-intensive, requiring practitioners to curate data, configure training, and iteratively diagnose model behavior.
Despite growing interest in autonomous machine learning and language agents, end-to-end LLM fine-tuning has not been systematically studied as an interactive agent task.
We introduce FT-Dojo, an interactive benchmark environment for autonomous LLM fine-tuning, comprising 13 tasks across 5 domains.
Rather than a new collection of static datasets, FT-Dojo standardizes a task interface, shared raw-data repository, sandboxed execution environment, structured feedback protocol, and held-out evaluation procedure.
We further develop FT-Agent, a fine-tuning-oriented autonomous framework that uses structured iteration planning, fail-fast validation, and multi-level feedback analysis to refine data and training strategies.
Experiments show that FT-Agent provides a strong initial baseline, achieving the best performance on 10 out of 13 tasks, with additional controlled comparisons against frontier agents, open-source planning backbones, and multi-run statistics supporting the main findings.
Case studies show that agents can recover from failures through cumulative learning, while still exposing limitations in causal diagnosis and long-horizon planning.
The implementation is available at https://github.com/microsoft/rd-agent.
Lay Summary: Adapting large language models to specialized tasks often takes a lot of expert effort. People must choose and clean training data, set up training runs, inspect failures, and repeat this process many times. Our work studies whether AI agents can help carry out this process more independently.
We introduce FT-Dojo, a testing environment that lets researchers compare AI agents on realistic model-adaptation tasks across several domains. FT-Dojo gives each agent the same task setup, data access, computing environment, feedback, and final evaluation, so their behavior can be compared fairly. We also build FT-Agent, an agent designed for this setting. It plans training iterations, checks for problems early, and uses feedback from failed or successful runs to improve its next attempt.
Our experiments show that FT-Agent is a strong starting point for this problem, performing best on most tasks. The results also show where current agents still struggle, especially when they need to find the true cause of failures or plan over many steps.
Originally Submitted Supplementary Material: zip
Link To Code: https://github.com/microsoft/rd-agent
Primary Area: General Machine Learning->Evaluation
Keywords: AI Evaluation, Benchmark and Evaluation, AutoML, LLM Agent, Autonomous Agents, LLM Finetuning
Originally Submitted PDF: pdf
Submission Number: 743
Loading