## Installation

Install dependent Python libraries by running the command below.
```
pip install -r requirements.txt
```

Please download the model weights used in our experiments from [mistralai/Mistral-7B-Instruct-v0.1 on Hugging Face](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1) and place it in the `./model` directory.

## Retriever Setup

All the code related to the retriever setup is in the `./code/retrievers` directory. We provide two retrieval services
as reported in our paper:

1. **BM25** Retrieval Service using ElasticSearch
2. **BGE** Retrieval Service using FAISS

### Downloads:
1. 2018 English Wikipedia Corpus: `wget https://dl.fbaipublicfiles.com/dpr/wikipedia_split/psgs_w100.tsv.gz`
2. BGE Embedding Model Weights: https://huggingface.co/BAAI/bge-large-en-v1.5

### Dependencies:
- FAISS : https://github.com/facebookresearch/faiss or https://pypi.org/project/faiss/
- SentenceTransformers: https://github.com/UKPLab/sentence-transformers
- Flask
- Torch
- ElasticSearch

### Quick Start to set up **BGE** Retrieval Service

1. Encode snippets into embeddings by running `python encode_wiki_bge.py`
2. Run `python bge_faiss.py` to set up bge-retrieval service
3. Sample code to call bge-retrieval service: `python send_req_bge_wiki.py -q query -k stop_k`
    - `--use_prefix` appends the prefix `Represent this sentence for searching relevant passages:` in front of queries for asymmetric encoding of queries and passages

### Quick Start to set up ES (Elasticsearch) Retrieval Service (**BM25**)

1. Run `python es_dictionary.py` to convert passages in tsv to desired dictionary format.
2. Run `python es_service.py` to set up Elasticsearch Retrieval Service
3. Sample code to call es-retrieval service: `python send_es_req.py -q query -k stop_k`

After deploying the retrieval service, please complete the corresponding retrieval functions in `./code/retrieval.py`.

## Evaluation

All the commands can be found in `./run.sh`

### TriviaQA

```bash
#!/bin/bash
SCRIPT_PATH="run.py"

# Constant parameters
CONFIG="configs/run.json"
MODEL="run_short_form"
DATASET="triviaqa"
TASK="triviaqa"
MAX_NEW_TOKENS=1024
METRIC="match"

# triviaqa
python $SCRIPT_PATH \
--config $CONFIG \
--model $MODEL \
--dataset $DATASET \
--task $TASK \
--max_new_tokens $MAX_NEW_TOKENS \
--retrieve_method "bge_serper" \
--metric $METRIC \
--use_tvq
```

### PopQA

```bash
#!/bin/bash
SCRIPT_PATH="run.py"

# Constant parameters
CONFIG="configs/run.json"
MODEL="run_short_form"
DATASET="popqa"
TASK="popqa"
MAX_NEW_TOKENS=1024
METRIC="match"

python $SCRIPT_PATH \
--config $CONFIG \
--model $MODEL \
--dataset $DATASET \
--task $TASK \
--max_new_tokens $MAX_NEW_TOKENS \
--retrieve_method "bge_serper" \
--metric $METRIC \
--use_tvq \
--continue_gen_without_contents
```

### ASQA

```bash
#!/bin/bash
SCRIPT_PATH="run.py"

# Constant parameters
CONFIG="configs/run.json"
MODEL="run_long_form"
DATASET="asqa"
TASK="asqa"
MAX_NEW_TOKENS=130

python $SCRIPT_PATH \
--config $CONFIG \
--model $MODEL \
--dataset $DATASET \
--task $TASK \
--max_new_tokens $MAX_NEW_TOKENS \
--retrieve_method "bge" \
--use_tvq \
```

[ALCE/ASQA](https://github.com/princeton-nlp/ALCE) offers a thorough evaluation of long-form QA using various metrics. To conduct your initial evaluation, install the ALCE repository and download the necessary data.
```bash
git clone https://github.com/princeton-nlp/ALCE.git
python3 -m alce_env
cd ALCE
bash download_data.sh
```

### Bio Generation

```bash
#!/bin/bash
SCRIPT_PATH="run.py"

# Constant parameters
CONFIG="configs/run.json"
MODEL="run_long_form"
DATASET="fact"
TASK="fact"
MAX_NEW_TOKENS=300

python $SCRIPT_PATH \
--config $CONFIG \
--model $MODEL \
--dataset $DATASET \
--task $TASK \
--max_new_tokens $MAX_NEW_TOKENS \
--retrieve_method "bge_serper" \
--use_tvq \
```

Please follow the instructions in the [FactScore](https://github.com/shmsw25/FActScore) official repository to set up your environment.

To proceed, use the command below:
```bash
python -m factscore.factscorer --data_path YOUR_OUTPUT_FILE  --model_name retrieval+ChatGPT --cache_dir YOUR_CACHE_DIR --openai_key YOUR_OPEN_AI_KEY --verbose
```

### FreshQA

```bash
#!/bin/bash
SCRIPT_PATH="run.py"

# Constant parameters
CONFIG="configs/run.json"
MODEL="run_long_form"
DATASET="fresh"
TASK="fresh"
MAX_NEW_TOKENS=1024

python $SCRIPT_PATH \
--config $CONFIG \
--model $MODEL \
--dataset $DATASET \
--task $TASK \
--max_new_tokens $MAX_NEW_TOKENS \
--retrieve_method "serper" \
--use_tvq
```

Please follow the instructions provided in the [freshllms/freshqa](https://github.com/freshllms/freshqa) repository, which includes data and code for FreshLLMs as detailed in the paper available on [arXiv](https://arxiv.org/abs/2310.03214), to conduct your evaluation.
