# TurtleAI: Visual Programming and Reasoning Benchmark

TurtleAI is a comprehensive benchmark framework for evaluating visual programming and reasoning capabilities using turtle graphics. This repository provides tools for model evaluation, dataset generation.

## Table of Contents
- [TurtleAI: Visual Programming and Reasoning Benchmark](#turtleai-visual-programming-and-reasoning-benchmark)
  - [Table of Contents](#table-of-contents)
  - [Installation](#installation)
  - [Project Structure](#project-structure)
  - [Dataset](#dataset)
    - [Dataset Schema](#dataset-schema)
    - [Sample Entry](#sample-entry)
  - [Model Evaluation](#model-evaluation)
    - [Configuration](#configuration)
    - [Evaluation Pipeline](#evaluation-pipeline)
  - [Synthetic Data Generation](#synthetic-data-generation)

## Installation

```bash
pip install -r requirements.txt
```

## Project Structure

```
src/
├── exps/
│   ├── eval_vlms/      # Model evaluation scripts
│   └── datasetgen/     # Dataset generation scripts
├── turtlegfx/          # Core Turtle Graphics utilities
└── turtlegfx_datagen/  # Dataset generation utilities
```

## Dataset

The evaluation dataset consists of 823 tasks, located at:
```
src/turtlegfx/data/graphics/dataset_graphics_sz823.json
```

### Dataset Schema

| Field | Description | Values |
|-------|-------------|---------|
| `id` | Unique task identifier | String |
| `code` | Solution code| Python code |
| `task_image` | Task image | Base64 encoded PNG |
| `source` | Task origin | `midi`/`synthetic`/`handdrawn` |
| `category` | Task category | `basic_geometry`/`spiral`/`composite`/`translation`/`rotation`/`scaling` |
| `difficulty` | Difficulty level | `easy`/`medium`/`hard` |

### Sample Entry
```json
{
    "id": "midi_11i",
    "code": "def draw(t):\n    t.setheading(90)...",
    "task_image": "data:image/png;base64,...",
    "source": "midi",
    "category": "composite",
    "difficulty": "medium"
}
```


## Model Evaluation

### Configuration
Model settings are defined in `exps/eval_vlms/scripts/configs/base_models.yaml`

### Evaluation Pipeline

1. **Generate Prompts**
   ```bash
   bash exps/eval_vlms/scripts/build_prompts.sh
   ```
   Output location: `exps/eval_vlms/results/prompts/`

2. **Run Model Inference**
   ```bash
   sbatch exps/eval_vlms/scripts/build_responses_vlms_submit_yaml.sh
   ```
   Output location: `exps/eval_vlms/results/responses/`

   Alternatively, you can run the inference script directly:
   ```bash
   python exps/eval_vlms/build_responses_vlms.py \
        --model_name ${MODEL_NAME} \
        --prompt_file ${PROMPT_FILE} \
        --max_new_tokens ${MAX_NEW_TOKENS} \
        --do_sample \
        --top_p ${TOP_P} \
        --temperature ${TEMPERATURE} \
        --vllm_batch_size ${VLLM_MAX_NUM_SEQS} \
        --tensor_parallel_size ${TENSOR_PARALLEL_SIZE} \
        --output_path ${RESPONSE_FILE}
   ```

3. **Evaluate Results**
   ```bash
   bash exps/eval_vlms/scripts/eval_responses_vlms_yaml.sh
   ```
   Output location: `exps/eval_vlms/results/evaluations/`

    Alternatively, you can run the evaluation script directly:
   ```bash
    python src/turtlegfx/eval/eval_responses.py \
        --dataset_file "src/turtlegfx/data/graphics/dataset_graphics_sz823.json" \
        --prompt_file "${prompt_file}" \
        --response_file "${response_file}" \
        --output_file "${evaluation_file}" \
        --num_workers 8 \
        --use_embedding
   ```



## Synthetic Data Generation


The data generation framework creates synthetic datasets iteratively, using a seed dataset as its starting point. Follow these steps to generate your dataset:


**Step 1: Setup the configuration file**

First, configure the generation parameters in `exps/datasetgen/scripts/configs/dataset_train.sh`:
```bash
DATE="<dataset_id>"
ITERATION="<iter_number>"
CODEGEN_MODEL_NAME="<model>"
CODESCORING_MODEL_NAME="<model>"
SEED_FILE_ITER0="<path_to_seed_file>"
```


**Step 2: Initial Iteration**

Start the first iteration of data generation with this command:

```bash
bash exps/datasetgen/scripts/build_dataset.sh -c exps/datasetgen/scripts/configs/dataset_train.sh
```

This command executes three sequential steps: code mutation, deduplication, and elite selection. The output will be saved to `exps/datasetgen/results/${DATE}/iter1/seed_dataset_iter1.json`

**Step 3: Subsequent Iterations**

To run additional iterations:
1. Update the `ITERATION` parameter in the config file
2. Execute the same command:
```bash
bash exps/datasetgen/scripts/build_dataset.sh -c exps/datasetgen/scripts/configs/dataset_train.sh
```

Each new iteration builds upon the extended seed dataset from the previous run. Results are saved to `exps/datasetgen/results/${DATE}/iter${ITERATION}/seed_dataset_iter${ITERATION}.json`

**Step 4: Chain-of-Thought Labeling**

Once you've completed your desired number of iterations, apply chain-of-thought labeling to the dataset:

```bash
bash exps/datasetgen/scripts/build_dataset_relabel.sh -c exps/datasetgen/scripts/configs/dataset_relabel_config.sh
```