<h1 align="center">
<br>
CauSciBench: A Comprehensive Benchmark Assessing LLM Causal Reasoning for Scientific Research
</h1>

## Overview
Large language models (LLMs) are showing increasingly promising progress in facilitating scientific research, yet their ability in causal inference as a cornerstone of rigorous statistical method for scientific discovery remains largely under-explored.

**CauSciBench**, the first comprehensive benchmark designed to evaluate end-to-end (From natural language question to effect estimate) causal inference for scientific research aims to quantify this progress, spanning both the potential-outcomes and structural causal model (SCM) frameworks.

## Benchmark Data
We provide a thoroughly filtered list data and their corresponding natural language queries, consisting of three distinct sources.
- Real-world Studies (185 queries from 100 papers)
- Synthetic Scenarios (143 synthesized queries)
- Textbook Examples (39 queries that focus on causal inference from [QRData](https://github.com/xxxiaol/QRData))

These expert curated information regarding the causal queries are stored in `data/real_info.csv`, `data/synthetic_info.csv`, `data/qr_info.csv`, depending on the source the information originates from. 
Each entry in the above files contain the following core information:
- `paper_name`
- `data_description`
- `natural_language_query`
- `answer`
- `std_error`
- `is_significant`
- `method`
- `treatment`
- `outcome`
- `covariates`
- `running_var`
- `temporal_var`
- `instrument_var`
- `state_var`
- `interacting_variable`
- `multirct_treatment`
- `data_files`
- `mediator` (Synthetic Data exclusive)
- `domain` (Real-world Studies exclusive)
- `state_fixed`, `time_fixed` (Textbook Examples exclusive)

Data files used in the above queries are stored in `data/all_data`.

## Replication

### Prerequisites
- Python 3.10+ (local host)
- Docker (required for safe code execution)
- Provider credentials (optional, depending on `--api`)

### 1. Build Docker Image

```bash
docker build -t python-baseline-http -f baselines/Dockerfile.http baselines
```

### 2. Dataset Organization

The input CSV files and metadata are organized as follows:

```
data/
├── all_data/           # QR benchmark data + real-world datasets
├── synthetic_data/     # Synthetic datasets
└── json/              # Query files
    ├── qrdata.json
    ├── real_data.json
    └── synthetic_data.json
```

The JSON files containing queries are located in `data/json/`. If you need to create your own JSON query files from CSV metadata, use:

```bash
bash scripts/create_json.sh
```

### 3. Run Baseline Experiments

You can run experiments using either the provided script or directly with Python:

**Using the script:**
```bash
bash scripts/run_baseline.sh
```

**Using Python directly:**
```bash
python baselines/run_baselines.py \
  --queries data/json/qrdata.json \
  --output output/qrdata/qrdata_react_gpt-4o.json \
  --api openai \
  --model gpt-4o \
  --persistent \
  --react \
  --data-type qrdata
```

#### How baselines/run_baselines.py Works

The `run_baselines.py` script orchestrates the entire causal analysis workflow:

1. **Load Queries**: Reads JSON files containing causal questions and meta information
2. **Initialize LLM**: Connects to your chosen API provider (OpenAI, Azure, Vertex, Together)
3. **Docker Setup**: Starts Python containers for safe code execution
4. **Analysis Loop**: For each query:
   - Sends the causal question to the LLM with dataset context
   - LLM generates Python code for causal estimation
   - Executes code in Docker container with libraries like DoWhy, EconML
   - Iterates based on results and errors
   - Extracts structured final results
5. **Save Results**: Outputs comprehensive JSON with chat history, code, and analysis

**Key Parameters:**
- `--queries`: Path to JSON file with causal questions
- `--output`: File path where results are saved
- `--api`: LLM provider (e.g., openai, azure, vertex, together)
- `--model`: LLM model (e.g., gpt-4o, claude-3-sonnet)
- `--persistent`: Use stateful Python environment
- `--potm/--react/--chain`: Different prompting strategies; default is direct prompting
- `--data-type`: Dataset category (real, synthetic, qrdata)

#### Output Structure

Each experiment produces a JSON file:
```
output/{data_source}/{data_source}_{prompt}_{model}.json
```

Example: `output/qrdata/qrdata_react_gpt-4o.json`

### 4. Compile Results

After running experiments, compile the JSON outputs into CSV format:

```bash
python scripts/compile_results.py -if output/{data_source} -of errors/{data_source} -sd {data_source}
```

For example, for qrdata run:
```bash
python scripts/compile_results.py -if output/qrdata -of errors/qrdata -sd qrdata
```

The `compile_results.py` script processes all JSON files in the input folder and creates CSV files for each prompting strategy (basic, cot, pot, react). It standardizes method names, extracts causal effects, and combines results across different models into a structured format for analysis.

#### CSV Results
Compiled results are saved as:
```
errors/{data_source}/{data_source}_{prompt}.csv
```

Each CSV contains the original queries, ground truth methods/effects, and predictions from each model tested.

## Script Configuration

### scripts/run_baseline.sh
Modify these variables to change experimental settings:
- `QUERIES_JSON_PATH`: Which dataset to analyze
- `API`: LLM provider
- `MODEL`: Specific model to use
- `OUTPUT_FOLDER`: Folder where results are saved 

### scripts/create_json.sh
Configure column mappings for CSV-to-JSON conversion:
- `FILENAME_COL`: Dataset filename column
- `DESC_NAME_COL`: Dataset description column
- `QUERY_NAME_COL`: Causal query column
- `PAPER_NAME_COL`: Study identifier column
- `OUTPUT_FOLDER`: Folder where output is saved
- `OUTPUT_NAME`: Name of the output JSON file

