# Real-data schema (inferred)

This document describes the expected columns for the released real-data CSVs.

## 1) `metadataset_Risk.csv` (run-level / probe-level)

Each row corresponds to one run (seed) evaluated at a probing depth `probe_c` for a given dataset/task `dataset_name`.

**Index / identifiers**
- `dataset_name`: task/dataset identifier (string)
- `seed`: random seed / run id (int)
- `probe_c`: probing depth (optimization steps; int)

**Outcomes**
- `R_true`: final outcome/performance for that run (float)
- `R_pred`: predicted outcome from the predictor (float; optional for some analyses)
- `squared_error`: per-run squared error `(R_true - R_pred)^2` used as the empirical risk proxy (float)

**Static / dynamic features**
All remaining columns (e.g., `dataset_num_items`, `base_model_perplexity`, `loss_decay_rate`, `gradient_consistency`, etc.)
are treated as pre-computed features and are not required for reproducing the figures unless explicitly stated.

## 2) `risk_curve_by_dataset.csv` (aggregated curves)

Each row corresponds to one dataset/task and one probing depth.

- `dataset_name`: task/dataset identifier
- `probe_c`: probing depth
- `L_hat`: empirical risk estimate at depth `c`
- `R_mean`: mean outcome across runs (optional)
- `R_var`: variance of outcomes across runs (optional)
- `n`: number of runs aggregated at that `(dataset_name, probe_c)` (optional)

