# Shifting Attention to Relevance (_SAR_)

### Environments

Please config environment by following `requirements.txt`.

### Data Preparing
```shell
cd src
sh parse_datasets.sh
```
It will automatically parse CoQA, Trivia QA, and SciQ datasets.

### Uncertainty Estimation for ''off-the-shelf'' LLMs (Table 1 in the manuscript)
To reproduce the results reported in Table 1 (in the manuscript), simply run:
#### for the CoQA dataset
```shell
sh scripts/coqa/ue_pipeline_opt-2.7b.sh

sh scripts/coqa/ue_pipeline_opt-6.7b.sh

sh scripts/coqa/ue_pipeline_opt-13b.sh

sh scripts/coqa/ue_pipeline_opt-30b.sh

sh scripts/coqa/ue_pipeline_llama-7b.sh

sh scripts/coqa/ue_pipeline_llama-13b.sh
````

#### for the SciQ dataset:
```shell
sh scripts/sciq/ue_pipeline_opt-2.7b.sh

sh scripts/sciq/ue_pipeline_opt-6.7b.sh

sh scripts/sciq/ue_pipeline_opt-13b.sh

sh scripts/sciq/ue_pipeline_opt-30b.sh

sh scripts/sciq/ue_pipeline_llama-7b.sh

sh scripts/sciq/ue_pipeline_llama-13b.sh
```

#### for the Trivia QA dataset:
```shell
sh scripts/trivia_qa/ue_pipeline_llama-7b.sh

sh scripts/trivia_qa/ue_pipeline_llama-13b.sh
```

Some LLMs requiring large GPU Memory, such as OPT-30b and LLaMA-13b. One of the option is to dispatch these large model
into multiple GPUs. It can be achieved by simply specifying the `devices` option within each script file. Currently, only OPT-30b and LLaMA-13b
are supported. For other smaller language models, we recommend using single GPU with at least 48G memory.

### Uncertainty Estimation for commercial LLMs (Table 2 in the manuscript)
`cd src/online_models`

Specify the OpenAI API key at `line:197 of online_model_eval.py`

Assume want to reproduce the results of Trivia QA over text-davinci-002, execute:
1. ```python3 online_model_eval.py --dataset trivia_qa --model text-davinci-002```  
it will
   1. randomly select and parse 1,000 questions from the training split of the Trivia QA dataset
   2. requesting text-davinci-002 to get response regarding each question.
   3. requesting text-davinci-002 to get similarities between model generation and real answers
2. get semantic clusters (if need to compare with Semantic Entropy): 

```python3 get_semantic_clusters.py --generation-path trivia_qa-text-davinci-002_generations.pkl```
3. correctness evaluation with Rouge-L and Sentence Similarity: 

```python3 correctness_eval.py --generation-path trivia_qa-text-davinci-002_generations.pkl```
4. token-wise relevance calculation: 

```python3 get_tokenwise_importance.py --generation-path trivia_qa-text-davinci-002_generations.pkl```
5. sentence-wise relevance calculation: 

```python3 get_sentence_similarities.py --generation-path trivia_qa-text-davinci-002_generations.pkl```
6. report uncertainty in AUROC: 
```python3 compute_uncertainty.py --generation-path trivia_qa-text-davinci-002_generations.pkl```

When evaluating with `compute_uncertainty.py`, you may specify any metrics and thresholds you like. 
For example, to reproduce our results over the Trivia QA dataset:
```shell
python3 compute_uncertainty.py --generation-path trivia_qa-text-davinci-002_generations.pkl \
--threshold 0.5 --metrics rougeL_to_target sentsim similarity --method 
```
By default, it will evaluate uncertainty with ```token-sar, sentence-sar, sar, predictive-entropy, len-normed-predictive-entropy, semantic-entropy, lexical-similarity```
