# Toward Self-Evolving Systems of LLM Agents through Exploration and Iterative Feedback

This repository contains the codebase for the paper:

**"Toward Self-Evolving Systems of LLM Agents through Exploration and Iterative Feedback""**

## Overview

This project presents a framework for discovering and improving the skills of language agents in web-based environments. The system iteratively generates tasks, collects agent trajectories, and uses feedback to guide further exploration and skill acquisition. The approach is model-agnostic and supports both OpenAI API-based and HuggingFace-based language models.

## Main Components

- **Task Generation**: Automatically generates diverse web shopping tasks for agents to solve, using either OpenAI models or HuggingFace models.
- **Agent Exploration**: Agents interact with a simulated web environment, taking actions based on instructions and persona information.
- **Feedback and Skill Analysis**: Collects feedback on agent performance, identifies skill gaps, and generates targeted feedback for further exploration.
- **Evaluation**: Evaluates agent performance on held-out tasks, supporting both custom and standard evaluation protocols.
- **Data Conversion**: Converts collected trajectories and feedback into JSON/JSONL formats for training and analysis.

## Repository Structure

- `alice_create_task_openai.py`: Task generation and exploration pipeline using OpenAI API models.
- `alice_create_tasks.py`: Task generation and exploration pipeline using HuggingFace models.
- `alice_feedback.py`: Feedback collection and skill analysis for agent trajectories.
- `eval_bob.py`: Evaluation of agent performance on custom or standard tasks.

## Setup and Dependencies

### Requirements

First, you need to set up the webshop environment following the original repository:

```bash
git clone WEBSHOP GIT URL
cd webshop
pip install -e .
```

Then install the remaining dependencies with:

```bash
pip install -r requirements.txt
```

### Additional Files

- `persona.json`: List of persona strings (or use the default from `proj-persona/PersonaHub`).
- `triplets.csv`: CSV file with few-shot action sequences for agent prompting.
- Model checkpoints: For HuggingFace-based pipelines, provide the path to a compatible language model checkpoint.

## Usage

### 1. Task Generation and Exploration

#### OpenAI-based Pipeline

Generate tasks and agent trajectories using OpenAI models:

```bash
python alice_create_task_openai.py --api-key <OPENAI_API_KEY> --model gpt-4o-mini --num-tasks 100 --output-dir output_data --persona-path persona.json
```

Key arguments:

- `--api-key`: Your OpenAI API key.
- `--model`: OpenAI model name (e.g., `gpt-4o-mini`).
- `--num-tasks`: Number of tasks to generate.
- `--output-dir`: Directory to save results.
- `--persona-path`: Path to persona data (default: `persona.json`).

#### HuggingFace-based Pipeline

Generate tasks and agent trajectories using a local HuggingFace model:

```bash
python alice_create_tasks.py --model-path <MODEL_PATH> --output-path output/raw_instruction.json --persona-path persona.json --num-tasks 100
```

Key arguments:

- `--model-path`: Path to the HuggingFace model checkpoint.
- `--output-path`: Path to save output JSON.
- `--persona-path`: Path to persona data.
- `--num-tasks`: Number of tasks to generate.

### 2. Feedback and Skill Analysis

Collect feedback and analyze skill gaps:

```bash
python alice_feedback.py --model-path <MODEL_PATH> --output-path output/feedback.json --persona-path persona.json --eval-file <EVAL_FILE>
```

Key arguments:

- `--model-path`: Path to the model checkpoint.
- `--output-path`: Path to save feedback results.
- `--persona-path`: Path to persona data.
- `--eval-file`: Path to evaluation file (from previous runs).

### 3. Evaluation

Evaluate agent performance on custom or standard tasks:

```bash
python eval_bob.py --model-path <MODEL_PATH> --output-dir results --num-tasks 100
```

Key arguments:

- `--model-path`: Path to the model checkpoint.
- `--output-dir`: Directory to save evaluation results.
- `--num-tasks`: Number of evaluation tasks.
