# ML-Agent: Reinforcing LLM Agents for Autonomous Machine Learning Engineering
This repository is the official implementation of ML-Agent: Reinforcing LLM Agents for Autonomous Machine Learning Engineering

## Machine Learning Tasks
In our paper, we organize 9 training tasks (held-in) for both exploration-enriched fine-tuning and step-wise reinforcement learning(RL) and 10 tasks (held-out) for further evaluation. All tasks all collected from [MLAgentBench](https://arxiv.org/abs/2310.03302) (MLA) and [MLEBench](https://arxiv.org/abs/2410.07095) (MLE). 

The tasks are listed as follows:
| Task Name                                           | Data Type | Task Type    | Metric   | Source |
|-----------------------------------------------------|-----------|--------------|----------|--------|
| **Training**                                         |           |              |          |        |
| cifar-10                                            | Image     | Classification | Acc. (%) ↑ | MLA    |
| aerial-cactus-identification                        | Image     | Classification | AUC ↑    | MLE    |
| dogs-vs-cats-redux-kernels-edition                  | Image     | Classification | Logloss ↓ | MLE    |
| plant-pathology-2020-fgvc7                          | Image     | Classification | AUC ↑    | MLE    |
| home-data-for-ml-course                             | Tabular   | Regression    | MAE ↓    | MLA    |
| spaceship-titanic                                   | Tabular   | Regression    | Acc. (%) ↑ | MLA    |
| nomad2018-predict-transparent-conductors            | Tabular   | Regression    | RMSLE ↓  | MLE    |
| feedback-prize-english-language-learning            | Text      | Classification | MCRMSQE ↓ | MLA    |
| ogbn-arxiv                                          | Graph     | Classification | Acc. (%) ↑ | MLA    |
| **Testing**                                          |           |              |          |        |
| denoising-dirty-documents                           | Image     | Generation    | RMSE     | MLE    |
| leaf-classification                                 | Image     | Classification | Logloss  | MLE    |
| statoil-iceberg-classifier-challenge                | Image     | Classification | Logloss  | MLE    |
| whale-categorization-playground                     | Image     | Classification | MAP@5    | MLE    |
| learning-agency-lab-automated-essay-scoring-2        | Text      | Regression    | QWK      | MLE    |
| detecting-insults-in-social-commentary              | Text      | Classification | Acc. (%) ↑ | MLE    |
| spooky-author-identification                         | Text      | Classification | Logloss  | MLE    |
| jigsaw-toxic-comment-classification-challenge       | Text      | Classification | AUC      | MLE    |
| us-patent-phrase-to-phrase-matching                 | Tabular   | Regression    | PCC      | MLE    |
| tabular-playground-series-dec-2021                  | Tabular   | Regression    | Acc. (%) ↑ | MLE    |


## Requirements
- **Training.** For exploration-enriched fine-tuning, we use [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) codebase; for step-wise RL, we use [VeRL](https://github.com/volcengine/verl) codebase. Please follow the instructions in the respective repositories to set up the environment for training.

- **Evaluation.** For environment to evaluate the agents on different machine learning tasks, please follow the instructions below: ```
pip install -r eval_requirements.txt```. Despite this, machine learning tasks are executed in another environment. You can use the following command to create a new conda environment mlagentbench:
    ```
    conda create -n mlagentbench python=3.10
    conda activate mlagentbench
    pip install -r mlagent_requirements.txt
    ```



## Training
### Exploration-enriched Fine-tuning
The base model for fine-tuning is Qwen2.5-7B, you can dowonload here [Qwen/Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B).
The example data samples are shown in ```data/sft/multi_turn_sft-l2-sample2-history_thought.json```.
```
bash scripts/train_step1.sh
```
### Step-wise Reinforcement Learning
RL is trained on the fine-tuned model from step 1.
The example data samples are shown in ```data/ppo/ppo-l90-sample90.json```.
```
bash scripts/train_step2.sh
```
Note that the saved checkpoint should be convert to right format for evaluation by run ```verl/scripts/model_merger.py```.
## Evaluation
We evaluate on 3 held-in tasks (cifar-10, house-price, feedback) and 10 held-out tasks.
```
bash scripts/eval.sh
```
You can specific the evaluation task and the model checkpoint by setting the following parameters in the script ```scripts/eval.sh```.
