# Simulator-based RAG for Grounding LLMs in Long-form Scientific Question Answering

Code for paper "Simulator-based RAG for Grounding LLMs in Long-form Scientific Question Answering".

## Abstract
Long-form question answering in scientific domains is crucial for comprehensive understanding of complex phenomena that require detailed explanations spanning multiple interconnected concepts and evidence. While large language models (LLMs) show promise for scientific question answering, they often suffer from hallucination. Retrieval-Augmented Generation (RAG) approaches can ground LLMs by incorporating external knowledge sources to improve trustworthiness. In this context, scientific simulators, which play a vital role in validating hypotheses, offer a particularly promising retrieval source to mitigate hallucination and enhance answer factuality. However, existing RAG approaches cannot be directly applied for scientific simulation-based retrieval due to two fundamental challenges: how to retrieve from scientific simulators, and how to efficiently verify and update long-form answers. To overcome these challenges, we propose the simulator-based RAG framework (SimulRAG) and provide a long-form scientific QA benchmark covering climate science and epidemiology with ground truth verified by both simulations and human annotators. In this framework, we propose a generalized simulator retrieval interface to transform between textual and numerical modalities. We further design a claim-level generation method that utilizes uncertainty estimation scores and simulator boundary assessment (UE+SBA) to efficiently verify and update claims. Extensive experiments demonstrate that SimulRAG outperforms traditional RAG baselines in factuality and informativeness, while UE+SBA improves efficiency and quality for claim-level generation.

## Installation

To set up the environment, you'll need Python 3.9 and the required dependencies:

```bash
conda create -n uesba python=3.9
conda activate uesba
pip install -r requirements.txt
```

Tools setup:
```bash
cd src/Epidemiology/gleam-ai-shared
pip install -e .
```

## Configuration

Before running the code, configure your API keys in the `.env` file:

```
OPENAI_API_KEY="ADD_YOUR_API_KEY"
Google_API_KEY="ADD_YOUR_API_KEY"
```

## Project Structure

The project consists of two main domains:
- `src/Climate`: Climate science experiments
- `src/Epidemiology2`: Epidemiology experiments

All commands below should be executed within the respective domain directories:
```bash
cd src/Climate  # or cd src/Epidemiology2
```

## Usage

### Main Workflow: UE + SBA + RAG

The main workflow consists of four key components:

1. **Uncertainty Estimation (UE)**: `uncertainty_analysis.py` - Calculate uncertainty scores for claims using multiple methods
2. **Scientific Boundary Assessment (SBA)**: `tools_analysis.py` - Determine scientific tool boundaries and assess when tools are needed
3. **RAG Simulation**: `rag_simulation.py` - Update claims based on uncertainty and boundaries through retrieval-augmented generation
4. **Final Answer Generation**: `rag_final_answer.py` - Generate final answers using the selected and updated claims

Run the complete workflow:
```bash
python uncertainty_analysis.py && python tools_analysis.py && python rag_simulation.py && python rag_final_answer.py
```

### Evaluation

**Claim Correctness Evaluation:**
```bash
python rag_claim_correctness.py
```
- `rag_claim_correctness.py` - Evaluate the correctness of claims after RAG processing

**Generate Analysis Metrics:**
```bash
python evaluation.py
```
- `evaluation.py` - Generate performance metrics and store table as `evaluation.csv`

### Data Generation

**Generate Question Templates and Open-ended Questions:**
```bash
python search_topics.py && python create_open.py && python upsample_open.py && python pre_answer_question.py && python finalize_question.py
```

File descriptions:
- `search_topics.py` - Generate initial question templates by searching relevant scientific topics
- `create_open.py` - Create open-ended questions from the templates
- `upsample_open.py` - Upsample and expand the open-ended question dataset
- `pre_answer_question.py` - Generate preliminary answers and prepare questions for RAG processing
- `finalize_question.py` - Finalize questions with corresponding RAG content and ground truth

## Model Checkpoints

Simulator checkpoints will be available soon.