# ESI: Epistemic Uncertainty Quantification via Semantic-preserving Intervention for Large Language Models
This is the official repository for our paper [ESI: Epistemic Uncertainty Quantification via Semantic-preserving Intervention for Large Language Models]


# Experiments
1. [Installation](#Installation)
2. [Datasets](#Datasets)
3. [Model Responses Generation and Evaluation](#Generation)
4. [Run ESI and Baselines](#ESI)
5. [Other](#Other)

## Installation
1. Create and activate python environment
    ```bash
    conda create -n esi python=3.11
    conda activate esi
    ```
2. Install pytorch
    ```bash
    pip3 install torch
    ```
3. 
    ```bash
    pip install -r requirements.txt
    ```
4. BEM correctness metric requires tensorflow, if cudnn version bug reported when running BEM, update cudnn version as follows:
    ```bash
    conda install -c conda-forge cudnn=9.3
    export LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH
    ```

**Other models and files needed:**

- Bem model (Used for correctness evaluation). Please download the model from [Download link](https://tfhub.dev/google/answer_equivalence/bem/1) then save it to the directory **data/model/** (or change the default cache path **CACHED_BEM_PATH** in **uncertainty/generation_evaluation/metrics/bem.py** line 40). 

- Other models needed can be downloaded directly from huggingface. you can add or replace any model by modifying the LLM_MODEL_CONFIG attribute in the LLM class in **uncertainty/utils/llm.py**:
  
    ```bash
    class LLM
     └───LLM_MODEL_CONFIG
         └───[the name to represent this model]
             │───model_name: [the name to represent this model]
             │───model_path: [path used for loading the model using .from_pretrained in huggingface]
             │───model_class: [the model class used before .from_pretrained in huggingface, such as 'AutoModel']
             │───fp16: [Whether use half precision]
             │───tokenizer_path: [path used for loading the tokenizer using .from_pretrained in huggingface]
             │───tokenizer_class: [tokenizer class used before .from_pretrained in huggingface]
    ```
  
  
## Datasets
download and process the datasets with the following code:
```
bash process_datasets.bash
```

## Generation
To generate the answers of QA datasets and evaluate its correctness, run the following code:
```
python run_generation.py -c config/${dataset}_config.yaml -dp data/datasets/${dataset}/test.jsonl -o output/cached_results/${dataset} -m ${model} -b 20
```

change ${dataset} to the dataset name you wanna run. Supported names are the directory names in **data/datasets**. Change ${model} to the base model name you wanna estimate the uncertainty. Model name should be same as the keys defined in LLM_MODEL_CONFIG attribute of LLM class.




## ESI
### RUN ESI (SOC) and Baselines
Evaluate performance of ESI (SOC) and baselines as follows:
```
python run_estimation.py -c ${path_save_the_generation_results} -o ${directory_to_save_the_results} -ot ${directory_to_save_the_results} -b 30 -n 10 -f skip -p 0.3 --sim_batch_size 512 --num_scores_returned 100 --sar_save_path ${directory_to_save_the_results}  --se_save_path ${directory_to_save_the_results} --inside_save_path ${directory_to_save_the_results} --mi_save_path ${directory_to_save_the_results} --sar_batch_size 10 --se_batch_size 10 --inside_batch_size 10 --mi_batch_size 10
```
${path_save_the_generation_results} is the path to the file "results.json" which save all results in the generation and evalaution step.

### RUN ESI (PARA) and Baselines
Firstly, generate the paraphrases using deepseek:
```
python run_prompt_paraphrase.py -d ${dataset} -a deepseek -o data/datasets/${dataset}/paraphrase.jsonl
```
Secondly, run ESI (PARA) and baselines
```
python run_estimation.py -c ${path_save_the_generation_results} -o ${directory_to_save_the_results} -ot ${directory_to_save_the_results} -b 30 -n 5 -f paraphrase --paraphrase_result_path data/datasets/${dataset}/paraphrase.jsonl --sim_batch_size 512 --num_scores_returned 100 --sar_save_path ${directory_to_save_the_results}  --se_save_path ${directory_to_save_the_results} --inside_save_path ${directory_to_save_the_results} --mi_save_path ${directory_to_save_the_results} --sar_batch_size 10 --se_batch_size 10 --inside_batch_size 10 --mi_batch_size 10
```
### RUN ESI Only
If you want to evaluate ESI (SOC) only, add --esi_only as follows:
```
python run_estimation.py -c ${path_save_the_generation_results} -o ${directory_to_save_the_results} -ot ${directory_to_save_the_results} -b 30 -n 10 -f skip -p 0.3 --num_scores_returned 100 --esi_only
```
### RUN Particular Baseline Only
If you want to evaluate certain baseline only, add --evaluate_method ${baseline name} as follows:
```
python run_estimation.py -c ${path_save_the_generation_results} -o ${directory_to_save_the_results} --evaluate_method ${baseline name}
```

## Other
- When generating parapharses for ESI(Para), please add your Deepseek API keys to **./run_prompt_paraphrase.py** line 15 'DEEPSEEK_API_KEY'.




