# Exploiting LLMs for Automatic Hypothesis Assessment via a Logit-Based Calibrated Prior

## Data

### Benchmarks

We release three benchmark datasets:

1. **Real-world Correlations**: Variable pairs with observed correlations, extracted from the [Cause-Effect](https://webdav.tuebingen.mpg.de/cause-effect/) and [Kaggle](https://www.kaggle.com/) datasets. For the Kaggle portion, we build on the dataset curated by [Trummer et al.](https://github.com/itrummer/DataCorrelationPredictionWithNLP).

   → [benchmark/real_world_correlations.csv](benchmark/real_world_correlations.csv)

2. **Counterfactual Correlations**: Cause-effect pairs with hypothetical contexts that reverse their original correlations.  
   → [benchmark/counterfactual_correlations.csv](benchmark/counterfactual_correlations.csv)

3. **Chicago Correlations**: A set of 115 correlations calculated on [Chicago Open Data](https://data.cityofchicago.org/), sampled from the dataset released by the [Nexus authors](https://github.com/TheDataStation/nexus_correlation_discovery). Of these, 15 are marked as hypothesis-worthy (`hypothesis=True` in the CSV file), based on the annotations reported in Table 2 of the original [Nexus evaluation](https://dl.acm.org/doi/10.1145/3654957).  
   → [benchmark/chicago_correlations.csv](benchmark/chicago_correlations.csv)

### Experiment Data

We also release the raw experimental outputs corresponding to the three benchmarks above, available in the `outputs` directory. This data can be used to reproduce all the results reported in the paper.

Besides, we include the raw experiment data for the RoBERTa classifer in the `outputs/roberta_classifier` directory.

## Installation

```bash
$ cd correlation_prior
$ conda create -n corr_prior python=3.11 -y
$ conda activate corr_prior
$ pip install -r requirements.txt
$ export PYTHONPATH="$(pwd):$PYTHONPATH" # make the correlation_prior importable
```

### Setting your OpenAI API Key

```bash
export OPENAI_API_KEY="your_api_key_here"
```

## Build Correlation Priors

The script `eval_llm_prior_parallel.py` generates different types of correlation priors using LLMs. The LLM calls are paralleled. The full usage of it is listed below:

```bash
usage: eval_llm_prior_parallel.py [-h] [--input_file INPUT_FILE] [--output_dir OUTPUT_DIR] [--model MODEL] [--prior PRIOR] [--num_iter NUM_ITER]
                                  [--ref_file REF_FILE] [--workers WORKERS]

Elicit various types of correlation priors from LLMs.

options:
  -h, --help            show this help message and exit
  --input_file INPUT_FILE
                        Path to the input correlations file.
  --output_dir OUTPUT_DIR
                        Path to the output directory.
  --model MODEL         The LLM model to use.
  --prior PRIOR         Prior type to use. Options: "gaussian_prior", "kde_prior", "lc_prior"
  --num_iter NUM_ITER   Number of iterations to run.
  --ref_file REF_FILE   Path to a previous output file for lookup; any correlation already in this file will be reused instead of being reprocessed.
  --workers WORKERS     The number of workers to use for parallel processing.
```

The generated priors are saved as CSV files in the specified output folder. For LCP, the discrete probability distribution is stored; when used online, these are converted into continuous distributions.

To reproduce all prior-generation methods used in the paper, simply run:

```bash
$ sh run_exp.sh
```

## Evaluate the quality of a correlation prior

We define various metrics such as sign accuracy, absolute error and Information content in the paper to evaluate the quality of a prior. You can run `process_results.py` script to get the performance of each metric for every prior on a benchmark.

Example Usage:

```bash
python eval/process_results.py \
  --benchmark_name "real_world_correlations" \
  --output_dir "outputs/real_world_correlations/" \
  --num_iter 1 \
  --model_type "gpt-4o" \
  --priors "Uniform,Gaussian,KDE,LCP"
```
 
Example Output:
```bash
+----------+---------------+-----------+-----------+---------------------+--------------+
|  Method  | Sign Accuracy |   Error   |    p(r)   | Information Content | 95% coverage |
+----------+---------------+-----------+-----------+---------------------+--------------+
| Uniform  |     0.511     | 0.51±0.29 | 0.50±0.00 |      0.69±0.00      |    92.3%     |
| Gaussian |     0.731     | 0.26±0.28 | 1.73±2.80 |      4.10±7.56      |    49.1%     |
|   KDE    |     0.788     | 0.26±0.27 | 1.61±1.89 |      1.73±4.48      |    59.9%     |
|   LCP    |     0.788     | 0.26±0.27 | 0.92±0.38 |      0.27±0.82      |    89.2%     |
+----------+---------------+-----------+-----------+---------------------+--------------+
```

