# Coordination Transformer (CooT)

This repository contains the implementation of Coordination Transformer (CooT), a model designed to coordinate with unseen biased agents using only in-context trajectories and query states.

## create conda environment
```bash
conda create -n coot python=3.9
conda activate coot
```

## Setup

```bash
# Install dependencies
pip install -r requirements.txt
cd zsceval
pip install -e .
```

# CooT Training and Evaluation Pipeline

This document explains the complete pipeline for training and evaluating CooT (Contextual Out-of-Task) agents in the Overcooked environment.

## Overview

The pipeline consists of four main steps:
1. **Generate Rollouts**: Create training data by running biased agents and their best responses
2. **Collect Data**: Process rollouts into training datasets
3. **Train Model**: Train the CooT model using the collected data
4. **Evaluate Model**: Test the trained model's performance

## Step 1: Generate Rollouts

### Purpose
Generate rollouts of biased agents and their best responses (BRs) to create training data. This step creates trajectory files that will be used to construct the training dataset.

### Command Line Usage

```bash
./render_overcooked.sh [LAYOUT] [ROLLOUTS] [SEED_TYPE] [SKILL_LEVEL] [NOISE_TYPE] [AGENT_TYPE]
```

#### Parameters:
- **LAYOUT** (default: "random1"): The Overcooked layout to use
  - Options: `random1`, `random0`, `random0_medium`, `random3`, `random1_m`, `random0_m`
- **ROLLOUTS** (default: "200"): Number of rollout episodes to generate
- **SEED_TYPE** (default: "train"): Type of seed data
  - Options: `train`, `eval`, `mep`
- **SKILL_LEVEL** (default: "final"): Skill level of the agents
  - Options: `final`, `mid`
- **NOISE_TYPE** (default: "none"): Type of noise to add
  - Options: `none`, `mid`, `small`
- **AGENT_TYPE** (default: "hsp"): Type of agent to use
  - Options: `hsp`, `mep`

#### Examples:
```bash
# Generate 200 rollouts for random1 layout with mid skill-level HSP agents 
./render_overcooked.sh random1 200 train mid none hsp

# Generate 100 rollouts for random0_medium with final skill-level MEP agents and small noise
./render_overcooked.sh random0_medium 100 mep final small mep
```

### Important Script Arguments

The `render_overcooked.sh` script calls `render_overcooked.py` with several important parameters:

#### Environment Configuration:
- **`--overcooked_version`**: Specifies the Overcooked version based on layout
  - `old`: For layouts `random1`, `random0`, `random0_medium`, `random3`
  - `new`: For layouts `random1_m`, `random0_m`

#### Weight Parameters:
- **`--w0`**: Weight vector for agent 0 (biased agent)
- **`--w1`**: Weight vector for agent 1 (best response agent)
  - These weights control the reward shaping and behavior of agents
  - Different layouts have different weight configurations

#### Training Configuration:
- **`--use_wandb`**: Enable Weights & Biases logging for experiment tracking
- **`--model_seed_start`** and **`--model_seed_end`**: Range of model seeds to use
  - Default: seeds 16-20 (5 different models)
- **`--store_traj`**: Save trajectory data to files
- **`--rollout_episodes`**: Number of episodes to run per rollout
- **`--episode_length`**: Maximum length of each episode (default: 200)
- **`--use_render`**: Render the trajectory into gif file

#### Agent Configuration:
- **`--use_hsp`**: Use HSP (Hierarchical Skill Policy) agents
- **`--share_policy`**: Share policy between agents
- **`--use_recurrent_policy`**: Use recurrent neural network policies

## Step 2: Collect Data

### Purpose
Process the generated rollouts into structured training datasets. This step extracts context windows, query states, and optimal actions from the trajectory files.

### Command Line Usage

```bash
./collect_overcooked.sh
```

The script is configured to run data collection for specific agent ranges. You can modify the `starts` and `ends` arrays in the script to change which agents to process.

### Important Script Arguments

The `collect_overcooked.sh` script calls `collect_data_improve.py` with these key parameters:

#### Data Generation Parameters:
- **`--hists`**: Number of history contexts to generate per agent (default: 125)
- **`--samples`**: Number of query samples to generate per history (default: 70)
- **`--agent_id_start`** and **`--agent_id_end`**: Range of agent IDs to process
- **`--episode_length`**: Length of episodes in the dataset (default: 200)
- **`--dataset_prefix`**: Prefix for the output dataset files

#### Agent Selection:
- **`--rollin_type`**: Type of rollout to use (default: "expert")
- **`--layout_name`**: Overcooked layout name

### Global Variables to Modify

Before running data collection, you need to modify several global variables in `collect_data.py`:

#### Path Configuration:
```python
STORAGE_PREFIX = 'prefix_to_where_collect_data_py_is_stored' 
```

#### Data Generation Parameters:
```python
CTX_ROLLOUTS = 5        # Number of context rollouts per history
MASK_ROLLOUT = True     # Whether to mask some rollouts with zeros
NUM_QUERY = 6           # Number of query states per sample
```

#### Important Notes:
- **`CTX_ROLLOUTS`**: Controls how many context trajectories are used per history
- **`MASK_ROLLOUT`**: When enabled, some rollouts are replaced with zero tensors to improve generalization
- **`NUM_QUERY`**: Number of query states that the model needs to predict actions for

### Output Files

The data collection process generates four separate pickle files:
- **`_query_s.pkl`**: Query states for prediction
- **`_optimal_a.pkl`**: Optimal actions corresponding to query states
- **`_context_s.pkl`**: Context states from rollouts
- **`_context_a.pkl`**: Context actions from rollouts

### Path Configuration

The script automatically handles different layouts and agent types, but you may need to verify these paths exist:

#### HSP Agent Paths:
- `hsp_final_none_150/`: Final skill level, no noise, 150 trajectories
- `hsp_final_small_150/`: Final skill level, small noise, 150 trajectories
- `hsp_final_mid_150/`: Final skill level, medium noise, 150 trajectories
- `hsp_mid_none_150/`: Mid skill level, no noise, 150 trajectories

#### MEP Agent Paths:
- `mep_final_none_150/`: Final skill level, no noise, 150 trajectories
- `mep_mid_none_150/`: Mid skill level, no noise, 150 trajectories

#### Evaluation Paths:
- `hsp_eval_none_50/`: Evaluation data, no noise, 50 trajectories

### Agent Grouping

The script automatically groups agents based on their IDs:
- **Bias Agents**: First set of agents (HSP agents with different skill levels)
- **MEP Agents**: Second set of agents (MEP agents for comparison)
- **Test Agents**: Separate set of agents for testing

The exact agent IDs for each group are defined in the script for each layout.

## Step 3: Train Model

### Purpose
Train the CooT model using the collected dataset. This step trains a transformer model to predict optimal actions given context states, actions and query states.

### Command Line Usage

```bash
./train_overcooked.sh
```

The script is configured to train with specific hyperparameters. You can modify the training command in the script to adjust model architecture and training parameters.

### Important Training Arguments

The `train_overcooked.sh` script calls `train.py` with these key parameters:

#### Model Architecture:
- **`--agents`**: Number of agents in the dataset (default: 36)
- **`--hists`**: Number of history contexts per agent (default: 125)
- **`--samples`**: Number of query samples per history (default: 70)
- **`--H`**: Horizon length for training (default: 200)
- **`--layer`**: Number of transformer layers (default: 4)
- **`--head`**: Number of attention heads (default: 2)
- **`--embd`**: Embedding dimension (default: 128)

#### Training Configuration:
- **`--lr`**: Learning rate (default: 0.00005)
- **`--num_epochs`**: Total number of training epochs (default: 70)
- **`--batch_size`**: Training batch size (default: 100)
- **`--patience`**: Early stopping patience (default: 25)
- **`--grad_clip`**: Gradient clipping norm (default: 0.25)
- **`--wd`**: Weight decay for regularization (default: 0.001)

#### Anti-Overfitting Measures:
- **`--dropout`**: Dropout rate (default: 0.3)
- **`--label_smoothing`**: Label smoothing factor (default: 0.0)
- **`--use_step_masking`**: Enable masking of specific steps in episodes
- **`--mask_steps_per_episode`**: Number of steps to mask per episode (default: 70)
- **`--use_curriculum_masking`**: Enable curriculum masking
- **`--mask_schedule`**: Masking schedule type (default: "logarithmic")
  - Options: `linear`, `exponential`, `logarithmic`

#### Data and Model Management:
- **`--dataset_prefix`**: Prefix for the dataset files
- **`--model_subdir`**: Subdirectory to save trained models
- **`--shuffle`**: Enable dataset shuffling during training
- **`--transformer`**: Transformer model type (default: "gpt2")
- **`--num_query`**: Number of query states per sample (default: 6)
- **`--increment`**: Increment for multi-dataset training (default: 6)

#### Examples:
```bash
# Train with custom parameters
python3 train.py --env overcooked --layout_name random1 --agents 36 \
  --hists 140 --samples 70 --H 200 --lr 0.00005 --layer 4 --head 2 \
  --dropout 0.3 --embd 128 --num_epochs 70 --batch_size 100 \
  --dataset_prefix my_dataset --model_subdir my_model \
  --use_step_masking --mask_steps_per_episode 70 \
  --num_query 6 --use_curriculum_masking --mask_schedule logarithmic \
  --transformer gpt2 --shuffle --rollin_type expert --seed 1 --wandb
```

### Model Architecture Customization

#### Modifying `net.py`:
The transformer architecture is defined in `net.py`. You can modify:

- **Transformer type**: Change between `gpt2` and `qwen2` models
- **Architecture parameters**: Adjust embedding dimensions, layers, and heads
- **Input processing**: Modify how states, actions, and rewards are embedded

#### Modifying `dataset.py`:
The dataset handling is defined in `dataset.py`. You can modify:

- **Data loading**: Change how trajectory data is processed
- **Shuffling**: Adjust the shuffling strategy for context sequences
- **Data augmentation**: Add custom data augmentation techniques

### Path Configuration

Before training, ensure these paths are correctly configured:

1. **Dataset paths**: Verify that `dataset_prefix` matches your collected data files
2. **Model save directory**: Check that `model_subdir` creates the desired directory structure
3. **Output directories**: Ensure `models/` and `figs/loss/` directories exist

The model will be saved as:
- **Best model**: `models/{model_subdir}/{filename}_best.pt`
- **Checkpoints**: `models/{model_subdir}/{filename}_epoch{N}.pt`
- **Final model**: `models/{filename}.pt`

### Training Process

The training process includes:

1. **Data Loading**: Load training and test datasets
2. **Model Initialization**: Create transformer model with specified architecture
3. **Training Loop**: Iterate through epochs with early stopping
4. **Step Masking**: Apply curriculum masking to improve generalization
5. **Evaluation**: Monitor test loss for model selection
6. **Model Saving**: Save best model and periodic checkpoints

## Step 4: Evaluate Model

### Purpose
Evaluate the trained CooT model's performance in the Overcooked environment. This step tests how well the model can adapt to new contexts and generate optimal actions.

### Command Line Usage

```bash
./eval_overcooked.sh
```

The script evaluates the model on different agent seeds and reports performance metrics.

### Important Evaluation Arguments

The `eval_overcooked.sh` script calls `eval_overcooked.py` with these key parameters:

#### Model Configuration:
- **`--agents`**: Number of agents to evaluate (default: 36)
- **`--hists`**: Number of history contexts (default: 140)
- **`--samples`**: Number of query samples (default: 70)
- **`--H`**: Context horizon length (default: 200)
- **`--epoch`**: Which training epoch to evaluate (default: 20)
- **`--model_subdir`**: Directory containing the trained model

#### Evaluation Settings:
- **`--hor`**: Episode horizon for evaluation (default: 200)
- **`--n_eval`**: Number of evaluation runs (default: 1)
- **`--batch_size`**: Evaluation batch size (default: 100)
- **`--use_step_masking`**: Use the same masking as during training
- **`--mask_steps_per_episode`**: Steps to mask per episode (default: 70)

#### Model Architecture (must match training):
- **`--layer`**: Number of transformer layers (default: 4)
- **`--head`**: Number of attention heads (default: 2)
- **`--embd`**: Embedding dimension (default: 128)
- **`--dropout`**: Dropout rate (default: 0.3)
- **`--transformer`**: Transformer model type (default: "gpt2")
- **`--num_query`**: Number of query states (default: 6)

### Global Variables to Modify

Before running evaluation, you need to modify several global variables in `eval_overcooked.py`:

#### Path Configuration:
```python
STORAGE_PREFIX = 'prefix_to_where_collect_data_py_is_stored' 
```

#### Evaluation Parameters:
```python
CTX_ROLLOUTS = 5        # Number of context rollouts
SEED_TYPE = "test"      # Type of seeds to evaluate: "train", "test", or "mep"
EPISODE_LENGTH = 200    # Length of evaluation episodes
HEPS = 50               # Number of episodes to run
GROUP_PREFIX = ""       # WandB group prefix
SKILL_LEVEL = "final"   # Skill level: "mid" or "final"
```

#### Model Paths:
```python
# Paths to biased agent models (must match training configuration)
model_path0 = f"/path/to/hsp{model_seed}_{SKILL_LEVEL}_w0_actor.pt"
model_path1 = f"/path/to/hsp{model_seed}_{SKILL_LEVEL}_w1_actor.pt"
```

#### Seed Configuration:
You can modify the `seed_range` for different layouts to test on different sets of agents:

```python
if layout_name == "random1":
    if SEED_TYPE == "train":
        seed_range = [1, 3, 5, 9, 11, 13, 15, 16, 17, 19]
    elif SEED_TYPE == "test":
        seed_range = [2, 8, 12, 15, 16, 17, 20, 27, 31, 50]
```

### Evaluation Process

The evaluation process includes:

1. **Model Loading**: Load the trained CooT model
2. **Environment Setup**: Create Overcooked environment with specified layout
3. **Best Response Calculation**: Compute optimal performance baseline
4. **Online Evaluation**: Test model adaptation over multiple episodes
5. **Performance Metrics**: Calculate cumulative rewards and adaptation curves
6. **Visualization**: Generate performance plots and save results

### Output and Results

The evaluation generates:

- **Performance metrics**: Average rewards and adaptation curves
- **Visualization plots**: Performance over episodes 
- **WandB logging**: If enabled, logs results to Weights & Biases
- **Console output**: Detailed performance statistics for each seed

### Important Notes

1. **Model Compatibility**: Ensure evaluation parameters match training parameters exactly
2. **Path Verification**: Verify all model and data paths exist before running
3. **Resource Requirements**: Evaluation can be computationally intensive; adjust batch sizes as needed
4. **Seed Selection**: Choose appropriate seed ranges based on your evaluation goals
5. **Performance Baseline**: The evaluation compares against best response agents to measure adaptation quality

## Complete Pipeline Summary

The complete CooT training and evaluation pipeline:

1. **Generate Rollouts** → Create trajectory data with biased agents
2. **Collect Data** → Process rollouts into training datasets  
3. **Train Model** → Train CooT transformer model
4. **Evaluate Model** → Test model performance and adaptation

Each step builds upon the previous one, creating a comprehensive system for training contextual out-of-task agents in the Overcooked environment. 