# Magentic One with GAIA (Mutated AutoGen Benchmark)

This repository contains a slight modified version of the [Microsoft AutoGen Benchmark tool](https://github.com/microsoft/autogen/tree/main/python/packages/agbench/) specifically adapted for the GAIA benchmark.

This mutated version maintains compatibility with the original AutoGen Benchmark while adding specialized functionality for our specific benchmarking needs.

## Setup

### Docker Requirement

AutoGenBench requires Docker (Desktop or Engine). **It will not run in GitHub Codespaces**, unless you opt for native execution (which is strongly discouraged). To install Docker Desktop, see [https://www.docker.com/products/docker-desktop/](https://www.docker.com/products/docker-desktop/).

If you are working in WSL, you can follow these instructions to set up your environment:

1.  Install Docker Desktop. After installation, a restart is needed. Then, open Docker Desktop, navigate to Settings > Resources > WSL Integration, and enable integration with your WSL distro (e.g., Ubuntu).
2.  Clone the `autogen` repository and export the `AUTOGEN_REPO_BASE` environment variable. This variable ensures Docker containers use the correct agent versions.
    ```bash
    git clone git@github.com:microsoft/autogen.git
    export AUTOGEN_REPO_BASE=<path_to_autogen_clone> # Replace <path_to_autogen_clone> with the actual path
    ```

### Install AutoGen Benchmark Tool

> [!NOTE]
> In autogen repo

```bash
git clone https://github.com/microsoft/autogen.git
cd autogen/
conda create -n autogen python=3.12
conda activate autogen
pip install -e autogen/python/packages/agbench
```

### Replace Mutated `scenario.py` for Cost Recording

> [!NOTE]
> In autogen repo

To enable cost recording, replace the original `scenario.py` in your `autogen` clone with the modified version from this repository.

1.  **Source:** `benchmarks/baselines/magentic_one/mutation/scenario.py` (relative to the root of the knowledge-graph-of-thoughts directory).
2.  **Destination:** `<path_to_autogen_clone>/python/packages/agbench/benchmarks/GAIA/Templates/MagenticOne/scenario.py` (replace `<path_to_autogen_clone>` with the path to your `autogen` repository clone).

This mutated script is necessary for recording token usage and costs.

### Setup AgBench

> [!NOTE]
> In autogen repo

Navigate to the GAIA benchmark directory within your `autogen` clone:
```bash
cd autogen/python/packages/agbench/benchmarks/GAIA
```

Update `config.yaml` in this directory to point to your model host. The default is 'gpt-4o', but for this benchmark, we typically use 'gpt-4o-mini'.

Now, initialize the tasks. This is a three-step process:

1.  **Run the initialization script for the first time:**
    ```bash
    python Scripts/init_tasks.py
    ```
    This command will attempt to download the GAIA dataset from Hugging Face, which may require authentication or might not complete if you prefer manual download.

2.  **Ensure GAIA data is present:**
    After the first script run, or if you've downloaded GAIA manually, ensure the data is correctly placed in the `./Downloads/GAIA` directory (relative to your current location, i.e., `autogen/python/packages/agbench/benchmarks/GAIA`). The expected structure is:
    ```
    .
    ├── Downloads
    │   └── GAIA
    │       ├── 2023
    │       │   ├── test
    │       │   └── validation
    ├── Scripts
    │   └── init_tasks.py
    ├── Templates
    │   └── MagenticOne
    │       └── scenario.py
    └── Tasks
    ```
    *(Ensure the `Downloads/GAIA` subdirectory and its contents are populated correctly.)*

3.  **Run the initialization script again:**
    Once the GAIA data is in place, run the script again to process the data and finalize task setup:
    ```bash
    python Scripts/init_tasks.py
    ```
    Upon successful completion, a `Tasks` folder will be created in the current directory (`autogen/python/packages/agbench/benchmarks/GAIA`), containing JSONL files for the benchmark tasks (e.g., `gaia_validation_level_1__MagenticOne.jsonl`).


## Quick Start

> [!NOTE]
> In autogen repo

To run a specific subset of GAIA (e.g., validation level 1 for MagenticOne):
```bash
agbench run Tasks/gaia_validation_level_1__MagenticOne.jsonl
```
Make sure you are in the `autogen/python/packages/agbench/benchmarks/GAIA` directory and the `autogen` conda environment is activated.


### Check the Results

> [!NOTE]
> In our repo

To analyze the benchmark results:

1.  Navigate back to your `Knowledge-Graph-of-Thoughts` repository directory.
2.  Deactivate the `autogen` conda environment and activate your `kgot` environment (refer to the main project `README.md` at the root of this repository for setup instructions if needed).
3.  Modify the `result_path` variable in `benchmarks/baselines/magentic_one/agbench_analysis.py` to point to the GAIA results folder generated by AutoGen Bench (typically found in `autogen/python/packages/agbench/benchmark_runs/GAIA/...`).
4.  Run the analysis script:
    ```bash
    python benchmarks/baselines/magentic_one/agbench_analysis.py
    ```

## Acknowledgement

Parts of this README are adapted from the official `autogen` repository documentation.
