## ObsScaling analysis: correlations, scaling laws, validation

### What this is
Utilities to analyze embodied-agent performance vs base LLM capability. Three analysis areas:
- exp_cor: cross-dataset correlations and LaTeX tables
- exp_scale: scaling-law style plots (vs FLOPs, model size, PCA PCs, masking/IT comparisons)
- exp_valid: validation and side-by-side comparisons across datasets/axes

Plots are written under `plots/` and subfolders; open the saved PNGs to view results.

### Inputs: what’s in `eval_results/`
You’ll find CSVs joined with compute and OpenLLM metrics used by the scripts:
- virtualhome_action_sequencing_results*.csv and behavior_action_sequencing_results*.csv: task- and error-rate metrics for action sequencing
- virtualhome_goal_interpretation*_results*.csv and behavior_goal_interpretation*_results*.csv: precision/recall/F1 metrics for goal interpretation
- "_with_flops" and "_with_flops_and_openllm" variants: augmented with FLOPs and OpenLLM benchmark columns (e.g., `Average`, `BBH`, `MATH Lvl 5`, `GPQA`, `MUSR`, `MMLU-PRO`, `IFEval`)
- other/…: helper tables (e.g., `livebench_image_name_mapping_v4.csv`, EAI aggregate tables)

Most analysis scripts expect the “with_flops_and_openllm” CSVs.

### Setup
From this directory (`supple/ObsScaling`):

```bash
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
```

Note: a few scripts contain absolute paths pointing to a different local checkout (e.g., `/Users/qinjielin/Downloads/NWU/25corl/corl_ws/ObsScaling`). If a script fails to read data, update those paths to this repo’s path or run from that checkout.

### (Optional) Preprocess to add FLOPs and OpenLLM columns

There is a helper run script in `exp_prepoc` (not in `exp_cor`/`exp_scale`/`exp_valid`):

```bash
python exp_prepoc/convert_result_to_csv.py --dataset virtualhome --eval-type action_sequencing_v4
python exp_prepoc/add_flops.py
python exp_prepoc/add_openllm.py
```

Run these if you need to regenerate the “_with_flops[_and_openllm]” CSVs.

### Correlation analyses (exp_cor)

Key scripts and outputs (images saved to `plots/correlation/`):
- `exp_cor/calculate_cor_openllm.py`: builds hierarchical correlation heatmap across simulations/tasks and writes `openllm_correlation_heatmap_hierarchical.png`; also creates `plots/correlation/all_correlation_tables.tex` with per-task LaTeX tables.
- `exp_cor/calculate_cor_openllm_v0.py`: multiple side-by-side heatmaps, saved as `openllm_correlation_heatmap_multiple.png`.
- `exp_cor/calculate_correlation_eai.py` and `exp_cor/calculate_cor_eai_action_seq.py`: two-panel heatmaps splitting VirtualHome vs Behavior; outputs like `correlation_heatmap_eai_separated.png` or `correlation_heatmap_eai_action_seq.png`.

Run examples:

```bash
python exp_cor/calculate_cor_openllm.py
python exp_cor/calculate_cor_openllm_v0.py
python exp_cor/calculate_correlation_eai.py
```

Open the PNGs under `plots/correlation/` to view the analysis.

### Scaling-law plots (exp_scale)

These scripts generate scaling figures into `plots/…` subfolders. Common inputs are `*_results_with_flops_and_openllm.csv`.

Typical entry points:
- FLOPs/size/PCA scaling for action sequencing (VirtualHome):
  - `exp_scale/plot_law_as_flops_obs.py`
  - `exp_scale/plot_law_as_flops.py`
  - `exp_scale/obs_pca_scaling.py`
- Goal interpretation scaling: `exp_scale/plot_law_gi_flops.py`, `exp_scale/plot_law_gi_flops_obs.py`
- Masking vs baseline comparisons: `exp_scale/plot_law_as_masked.py`, `exp_scale/plot_law_as_masked_obs.py`, goal-interpretation analogs `plot_law_gi_masked.py`, `plot_law_gi_masked_obs.py`
- Base vs instruction-tuned comparisons: `exp_scale/plot_law_as_it.py`
- Cross-simulation gap (VirtualHome vs Behavior): `exp_scale/plot_law_as_vh_bh.py`
- Family-filtered bar plots: `exp_scale/plot_bar_v1.py`, `exp_scale/plot_bar_mask.py`
- Generic scatter “law” figure: `exp_scale/plot_law.py` (uses CLI `--input`/`--output-dir`)

Run examples:

```bash
python exp_scale/plot_law_as_flops_obs.py
python exp_scale/plot_law_gi_flops_obs.py
python exp_scale/plot_law_as_masked_obs.py
python exp_scale/plot_law_as_vh_bh.py
python exp_scale/plot_law.py --input ./eval_results/virtualhome_action_sequencing_results_with_flops_and_openllm.csv --output-dir plots
```

Open the saved images under `plots/virtualhome/...`, `plots/behavior/...`, `plots/slides`, or `plots/sim_gap/...` depending on the script.

### Validation (exp_valid)

Sanity checks and combined views. Outputs are saved under `plots/validate/{action_sequencing|goal_interpretation}/`.

Run examples:

```bash
python exp_valid/validate_single.py
python exp_valid/validate_together.py
python exp_valid/validate_together_modular.py
```

The modular version (`validate_together_modular.py`) cleanly merges datasets, converts metrics to rates, configures PCA/no-PCA variants, and writes plot files with a hash-based filename plus a mapping file.

### Where to look for results

- Input data: `supple/ObsScaling/eval_results/` (CSV inputs used by all scripts)
- Correlations: `supple/ObsScaling/plots/correlation/*.png` and `all_correlation_tables.tex`
- Scaling: `supple/ObsScaling/plots/**` (per-dataset/task subfolders)
- Validation: `supple/ObsScaling/plots/validate/**`

If plots do not show, check for missing “_with_flops_and_openllm” columns in your CSVs and run the preprocessing steps, or adjust any absolute paths in the scripts to point to this checkout.


