# Group-Level Evaluation Pipeline

## Steps in the Pipeline

1. [`preprocessing.py --mode depth`](#1-preprocessingpy)
2. [`base_model_analysis.py`](#2-basemodelanalysispy)
3. [`paired_ttest.py`](#3-pairedttestpy)
4. [`plot_base_stats.py`](#4-plotbasestatspy)
5. [`plot_base_transition_mat.py`](#5-plotbasetransitionmatpy)
6. [`generate_base_model_summary.py`](#6-generatebasemodelsummarypy)
7. [`draw_scatter_plots.py`](#7-drawscatterplotspy)
8. [`generate_scatter_plot_summary.py`](#8-generatescatterplotsummarypy)
9. [`draw_histogram.py`](#9-drawhistogrampy)
10. [`generate_histogram_summary.py`](#10-generatehistogramsummarypy)

## Usage

```
python run_pipeline.py
```

### Options

- `--dry-run`: Print all steps without executing them.
```
python run_pipeline.py --dry-run
```

- `--from-step N`: Start from step `N` (1-based index).
```
python run_pipeline.py --from-step 3
```

- `--to-step N`: Stop at step `N` (1-based index).
```
python run_pipeline.py --to-step 7
```

- `--continue-on-error`: Continue running even if a step fails.
```
python run_pipeline.py --continue-on-error
```

## Step Details

### 1. `preprocessing.py`

This script loads raw experiment outputs, filters to depth topics, aligns tweet partners, harmonizes slider/likert values, and writes two preprocessed CSVs.

Inputs:
- `result/eval/human_llm/<exp_dir>/<model_name>/opinion_memory_gpt-4o-mini-2024-07-18_v0.csv`
- `data/table_file_info_20250830.csv`

Outputs:
- `preprocessed_depth.csv (all rows)`
- `preprocessed_depth_valid.csv` (rows from valid experiments only, based on table_file_info_20250830.csv)

### 2. `base_model_analysis.py`

This script computes per-topic and aggregated bias/diversity statistics for humans vs. `gpt-4o-mini-2024-07-18` across the event timeline (Initial → Tweets 1–3 → Post). Produces per-topic CSVs and an “all topics” summary for both stance (Likert-based) and slider views.

Inputs:
- Preprocessed data from step 1: `result/group_level_eval/preprocessed_depth_valid.csv`

Outputs:
- Per-topic folder: `result/group_level_eval/base_analysis/<topic>/`
    - `filtered_depth.csv (topic-scoped filtered rows after de-dup)`
    - `base_stance_stats.csv`
    - `base_slider_stats.csv`

### 3. `paired_ttest.py`

This script performs paired t-tests on human vs. model stance/slider trajectories across depth topics. Tests both means (bias) and stds (diversity) at the group level, plus diff-of-diff comparisons between human and LLM changes.

Inputs: 
- Preprocessed data: `result/group_level_eval/preprocessed_depth_valid.csv`
- Columns required:
    - `event_type`, `chat_order`, `time_stamp`, `topic/Topic`
    - `human_likert_pred`, `llm_likert_pred`
    - `human_slider`, `llm_slider`
    - `llm_text` (for deduplication)

Outputs: 

Saved to `result/group_level_eval/base_analysis/`:
- `paired_ttest_results.csv`. Columns include:
    - domain (stance / slider)
    - pair (tweet1_vs_tweet3, initial_vs_post)
    - who (human, llm, human_vs_llm)
    - stat_on (group_mean, group_std, diff_of_diff_mean, diff_of_diff_std)
    - n_groups, df, t_stat, p_value
    - Summary stats: group means & SDs for A vs B, plus diff_mean, sd_diff, se_diff.
- (Debug purpose)`paired_ttest_group_stats.csv`: contains per-topic, per-timestamp aggregated group stats used for the tests.


### 4. `plot_base_stats.py`

This script generates per-topic and aggregated plots for both stance and slider domains.

Inputs: 
- `result/group_level_eval/base_analysis/all_base_slider_stats.csv`
- `result/group_level_eval/base_analysis/all_base_stance_stats.csv`
- `result/group_level_eval/base_analysis/paired_ttest_results.csv`

Outputs:
- Per topic / per who:
    - `result/group_level_eval/base_analysis/<Topic>/<Human|LLM>/`
        - bias_trajectory_<stance|slider>.svg (+ _publication.svg)
        - std_trajectory_<stance|slider>.svg (+ _publication.svg)
        - summary_bias_errorbar_<stance|slider>.svg (+ _publication.svg)
        - summary_std_errorbar_<stance|slider>.svg (+ _publication.svg)
        - Split counterparts saved as <name>_<suffix>_{opinion|tweet}.svg (+ _publication.svg)
- Aggregated: 
    - `result/group_level_eval/base_analysis/`
    - `summary_bias_errorbar_<stance|slider>.svg` (+ `_publication.svg` and split variants)
    - `summary_std_errorbar_<stance|slider>.svg` (+ `_publication.svg` and split variants)

### 5. `plot_base_transition_mat.py`

This script builds and visualizes stance transition matrices (counts) for Initial → Post and Tweet 1 → Tweet 3 at the topic level and aggregated across topics, for both humans and each model. Also produces slider-based counterparts and difference heatmaps (model − human).

Inputs: 
- From base analysis: `result/group_level_eval/base_analysis/<Topic>/filtered_depth.csv`

Outputs:

Under `result/group_level_eval/base_analysis/<Topic>/`:

- Data used

    - `transition_mat_data.csv` — concatenated subset of rows used for transitions (opinion + tweet)

- Human (human/):

    - `transition_matrix.svg` — Initial→Post (Likert)

    - `tweet_transition_matrix.svg` — Tweet1→Tweet3 (Likert)

    - `transition_matrix_slider.svg` — Initial→Post (Slider)

    - `tweet_transition_matrix_slider.svg` — Tweet1→Tweet3 (Slider)

- Each Model (<model_name>/):

    - `transition_matrix.svg` — Initial→Post (Likert)

    - `tweet_transition_matrix.svg` — Tweet1→Tweet3 (Likert)

    - `transition_matrix_slider.svg` — Initial→Post (Slider)

    - `tweet_transition_matrix_slider.svg` — Tweet1→Tweet3 (Slider)

- Model–Human difference (for gpt-4o-mini-2024-07-18 only)

    - `transition_matrix_diff.svg` — (model − human), opinion

    - `tweet_transition_matrix_diff.svg` — (model − human), tweet

- Inclusion tuple logs

    - `tuples_human_transition_mat.txt`

    - `tuples_llm_transition_mat.txt`


### 6. `generate_base_model_summary.py`

This script assembles a Word report from the figures produced in prior steps. It converts SVG plots to PNG (via cairosvg/libcairo) and lays them out into multiple .docx files, including publication-ready and split-view variants. It also embeds transition matrices (per-topic and aggregated) and an Appendix with topic/timestamp metadata.

Inputs: 

From `result/group_level_eval/base_analysis/`:

- Per-topic plots (created by step 4 `plot_base_stats.py`), e.g.:
    - `summary_bias_errorbar(_stance|_slider)[_publication].svg`,
    - `summary_std_errorbar(_stance|_slider)[_publication].svg`,
    - plus combined variants: `combined_summary_*`

- Per-topic transition matrix plots (created by step 5), e.g.:
    - `<Topic>/human/transition_matrix.svg`, `<Topic>/gpt-4o-mini-2024-07-18/tweet_transition_matrix.svg`, etc.

- Aggregated transition matrix SVGs (step 5) in the base analysis folder.

Outputs:

```
result/group_level_eval/base_analysis/
├─ base_model_summary.docx
├─ base_model_summary_publication.docx
├─ base_model_summary_split.docx
└─ base_model_summary_split_publication.docx

```

### 7. `draw_scatter_plots.py`

This script creates scatter plots that relate opinion/stance levels and changes (Δ) to peers/partners, for both humans and LLM (`gpt-4o-mini-2024-07-18`). Also produces paired Δ (LLM vs Human) plots and publication variants.

Inputs:

- From step 2: `result/group_level_eval/base_analysis/filtered_depth.csv` (already de-duplicated per (time_stamp, event_type, chat_order, human_id) to the longest llm_text)

Outputs:

```
result/group_level_eval/
├─ base_analysis/
│  └─ filtered_depth.csv                # input used
└─ plots/
   ├─ <Topic>/
   │  ├─ stance_*.svg (+ _publication.svg)
   │  └─ slider_*.svg (+ _publication.svg)
   ├─ stance_*.svg / slider_*.svg       # aggregated
   ├─ paired_delta_slider*.svg
   ├─ paired_delta_stance*.svg
   └─ paired_delta_tweet*.svg

```

### 8. `generate_scatter_plot_summary.py`

This script builds two Word summaries that tile your per-topic and aggregated scatter plots into clean 3×3 grids—one regular and one publication (labels/legends removed in the source SVGs). It auto-converts SVGs → PNG via cairosvg (requires libcairo).

Inputs:

- Source plots from step 7 in: `result/group_level_eval/plots/`
    - Per-topic folders: `plots/<Topic>/...`
    - Aggregated (root-level): `plots/*.svg`

The script intentionally excludes the raw topic `Everything_that_happens_can_eventually_be_explained_by_science` and relies on its _reversed twin for summaries.

Outputs: 
- `result/group_level_eval/plots/scatter_grid_summary.docx`
- `result/group_level_eval/plots/scatter_grid_summary_pub.docx`

### 9. `draw_histogram.py`

This script produces side-by-side grouped histograms (with mean lines) comparing distributions of stance/slider values and their deltas (Δ) for humans vs LLM and within-source event pairs. Saves both normal and publication variants.

Inputs:
- `result/group_level_eval/preprocessed_depth_valid.csv (from Step 1)`
    - Filters to `model_name == "gpt-4o-mini-2024-07-18"`
    - Keeps only events in: `Initial(0), Tweet1(1), Tweet2(2), Tweet3(3), Post(4)`
    - Deduplicates per `(time_stamp, event_type, chat_order, human_id)` to longest `llm_text`
    - Uses Topic or topic column; reverse-codes `Everything_that_happens_can_eventually_be_explained_by_science`

Outputs: under `result/group_level_eval/histograms/`

```
barhist_human_tweet1_vs_tweet3.svg
barhist_human_tweet1_vs_tweet3_pub.svg
barhist_human_initial_vs_post_stance.svg
barhist_human_initial_vs_post_stance_pub.svg
barhist_llm_tweet1_vs_tweet3.svg
barhist_llm_tweet1_vs_tweet3_pub.svg
barhist_llm_initial_vs_post_stance.svg
barhist_llm_initial_vs_post_stance_pub.svg
barhist_diff_tweet3_minus_tweet1_stance.svg
barhist_diff_tweet3_minus_tweet1_stance_pub.svg
barhist_diff_post_minus_initial_stance.svg
barhist_diff_post_minus_initial_stance_pub.svg
barhist_human_initial_vs_post_slider.svg
barhist_human_initial_vs_post_slider_pub.svg
barhist_llm_initial_vs_post_slider.svg
barhist_llm_initial_vs_post_slider_pub.svg
barhist_diff_post_minus_initial_slider.svg
barhist_diff_post_minus_initial_slider_pub.svg

```

### 10. `generate_histogram_summary.py`

This script assembles two DOCX summaries that tile the grouped bar-histograms (from Step 9) into a 1×3 layout per group: Human | LLM | Δ for Slider, Stance, and Tweet. Handles both regular and publication variants.

Inputs:
- SVGs from `result/group_level_eval/histograms/` (produced by `draw_histogram.py`):

- Slider:
    - `barhist_human_initial_vs_post_slider.svg`
    - `barhist_llm_initial_vs_post_slider.svg`
    - `barhist_diff_post_minus_initial_slider.svg`
- Stance (Initial↔Post):
    - `barhist_human_initial_vs_post_stance.svg`
    - `barhist_llm_initial_vs_post_stance.svg`
    - `barhist_diff_post_minus_initial_stance.svg`
- Tweet (Tweet1↔Tweet3):
    - `barhist_human_tweet1_vs_tweet3.svg`
    - `barhist_llm_tweet1_vs_tweet3.svg`
    - `barhist_diff_tweet3_minus_tweet1_stance.svg`
- Publication doc uses the `*_pub.svg` counterparts.
