# Sample, Align, Synthesize: Graph-Based Response Synthesis with ConGrs

This repository contains code and data for "Sample, Align, Synthesize: Graph-Based Response Synthesis with ConGrs"

## Factuality Tasks

### 1. Data

Biography generation:

1. For our list of biography entities, refer to the `scripts/factscore_eval/bio_entities.txt` file.
2. Use the prompt: "Tell me a paragraph bio of " + entity + ". ", where the entity is each line in the bio_entities.txt file.

PopQA generation:

1. For our list of PopQA entities, refer to the `scripts/factscore_eval/popqa_entities.json` file.
2. Use the prompt: "Provide me with a paragraph detailing some facts related to " + wiki_title + ". ", where the wiki_title is a key-value pair in the list of dictionaries in the popqa_entities.json file.

All generated responses that we synthesize from are included in the `data` directory.

We use FActScore for evaluation. Download the Wikipedia database and relevant files here:
https://github.com/shmsw25/FActScore/blob/main/factscore/download_data.py

### 2. Generate Samples / Generate Baselines

**Sample bio generation script for the 8 bit Qwen72B-Instruct model:** `scripts/sample_bio_gen.py`

**Sample LM consensus generation script:** `scripts/baseline_generation/cons.py`

execution example:

```
  cd scripts/baseline_generation
  python cons.py --input <INPUT_FILE>  --output <OUTPUT_FILE>
```

### 3. ConGrs Construction

**Construction script:** `scripts/test_poa_bios_batch.ipynb`

ConGrs construction saves pkl files. Provide path to generated samples in the script.

### 4. Decoding

**Decoding script:** `scripts/test_poa_bios_batch_decode.ipynb`

Response Synthesis with Consensus Decoding

- Provide the path to your saved ConGr `.pkl` files in the script.
- Specify the task code in the script:
  - `popqa` → **PopQA**
  - `bio` → **Biographies**

### 5. Eval

**Note:** FActScore eval is to be executed first, and then HALoGEN eval. The output of the FActScore eval is the input of the HALoGEN eval.

**Sample FActScore eval input file:** `scripts/factscore_eval/sample_fs_input.json`

**Sample FActScore eval output file:** `scripts/factscore_eval/sample_fs_output.json`

execution example:

```
  cd scripts/factscore_eval
  python factscore_eval_run.py --data_path <INPUT_FILE>  --result_path <OUTPUT_FILE>
```

**Sample HALoGEN eval input file:** `scripts/halogen_eval/sample_factuality_input.json`

Add OpenAI API keys in the config.yml file.

execution example:

```
    cd scripts/halogen_eval
    python bio_scorer.py --input_dir <INPUT_FILE> --output_dir <INPUT_FILE>
```

Postprocessing example:

Add the path of the output file from previous step in the following script

```
    python factuality_postprocess.py
```

Add the path of the output file from previous step in the following script, this file generates results for tables 1 and 2 in the main paper

```
    python factuality_results.py
```

## Refusal-based tasks

### 1. Data

False presuppositions generation:
For our list of prompts, refer to the `scripts/halogen_eval/fp.json` file.

Scientific attributions generation:
For our list of prompts, refer to the `scripts/halogen_eval/refs.json` file.

Historical events generation:
For our list of prompts, refer to the `scripts/halogen_eval/he.json` file.

All generated responses that we synthesize from are included in the `data` directory.

### 2. Generate Samples / Generate Baselines

**Sample fp generation script for the 8 bit Qwen72B-Instruct model:** `scripts/sample_fp_gen.py`

**Sample LM consensus generation script:** `scripts/baseline_generation/cons.py`

execution example:

```
  cd scripts/baseline_generation
  python cons.py --input <INPUT_FILE> --output <OUTPUT_FILE>
```

### 3. ConGrs Construction

**Construction script:** `scripts/test_poa_bios_batch.ipynb`

ConGrs construction saves pkl files. Provide path to generated samples in the script.

### 4. Decoding

**Decoding script:** `scripts/test_poa_bios_batch_decode.ipynb`

Response Synthesis with Consensus Decoding

- Provide the path to your saved ConGr `.pkl` files in the script.
- Specify the task code in the script:
  - `fp` → **False Presuppositions**
  - `refs` → **Scientific References**

### 5. Eval

**Sample HALoGEN hallucination generation input file:** `scripts/halogen_eval/sample_refusal_halc_input.json`

Add Semantic Scholar and OPENAI API keys in the config.yml file.

execution example:

```
  cd scripts/halogen_eval
  python evaluate_hallucinations.py --input_dir <INPUT_FILE> --output_dir <OUTPUT_FILE> --scientific_attribution
```

**Sample HALoGEN eval input file:** `scripts/halogen_eval/sample_refusal_input.json`

execution example:

```
  cd scripts/halogen_eval
  python reference_scorer.py --input_dir <INPUT_FILE> --output_dir <OUTPUT_FILE>
```

postprocessing example:

Add the path of the output file from previous step in the script, this file generates results for table 5 in the main paper

```
  python refusal_results.py
```

## Reasoning tasks

### 1. Data

We use two benchmark datasets:

- [MATH Dataset](https://huggingface.co/datasets/nlile/hendrycks-MATH-benchmark/viewer/default/test)
- [AIME 2024 Dataset](https://huggingface.co/datasets/Maxwell-Jia/AIME_2024)

### 2. Generate Samples

**Inference scripts:**

- MATH: `scripts/math_sample_generation.py`
- AIME: `scripts/aime_sample_generation.py`

Generates 5 samples per instance for all models. Provide path to benchmark data in the scripts.

### 3. ConGrs Construction and Guided Self-Verification

**Construction and decoding script:** `scripts/process_math.ipynb`

ConGrs construction and response synthesis with guided self-verification. Provide path to generated samples in the script.

### 4. Evaluation

**Evaluation script:** `scripts/matheval.py`

Measures Accuracy across dataset. Provide path to final decoded responses file in the script.
