# Training-free LLM Verification via Recycling Few-shot Examples
This repository provides the code for the following paper [Training-free LLM Verification via Recycling Few-shot Examples], and provides the responses and data used in the experiment to reproduce the experiment.


## Environment Setup

1. **Create a new conda environment with Python 3.10**
   ```bash
   conda create -n my_env python=3.10
   ```

2. **Activate the environment and install dependencies**
   ```bash
   conda activate my_env
   pip install -r requirements.txt
   ```

3. **Install LaTeX-to-SymPy converter**
   ```bash
   cd math500/latex2sympy2
   pip install -e .
   ```

## Step-by-Step Workflow
### Notes
Most shell scripts inside `sh/` (e.g. `likelihood_all_gpt.sh`, `response_all_gpt.sh`) include `#SBATCH` directives and are meant to be submitted to a Slurm scheduler.

We basically ran our experiments using the A6000 GPU.

### Step 1: Generate Model Outputs
- Open and run `generate.ipynb`.

### Step 2: Compute Likelihoods
- Run the likelihood script:
  ```bash
  ./sh/likelihood_all_gpt.sh
  ```
- Ensure you have set the correct `model_name`, `input_dir`, and `output_dir` variables at the the script.
- After running this file, it will create an `all_likelihoods.json` file in `output_dir`.
- This file is used to calculate the `backward consistency` score.

### Step 3: Compute Baselines
- Execute the response-based baseline script:
  ```bash
  ./sh/response_all_gpt.sh
  ```
- Verify the same configuration variables (`model_name`, `input_dir`, `output_dir`) in this script as well.
- After running this file, it will create an `{task}_few_few.jsonl` file in `output_dir`.
- This `{task}_few_few.jsonl` file is used to calculate the `forward confidence` score.

### Step 4: Update Correctness Annotations
- After generating model responses in **Step 1**, run the helper function provided in each task notebook (e.g., `math500/math500.ipynb`).  
- This will create a `{task}_{shot}_scored.jsonl` file in the `result` folder.  
- The scored file includes the model outputs along with the `"is_correct"` key, which is required for subsequent scoring.

### Step 5: Apply ReFeri Methods
- Use `check.ipynb` to run our method on updated likelihood data.
- You can check `forward_score`, `backward_score` and `referi` which represent our final score. 

## Baselines Directory
- The `baselines/` folder contains below..
  - `response_likelihood_gpt.py`: File that calculates the forward, direct score required by our metric.

