# Supplementary Materials for 'Large language models are not zero-shot communicators'
Below a brief description about the contents of this folder. Broadly it contains the data, human annotations, the prompt templates, a preview of the raw results, and the code used for the evaluations.

## The data
The test set can be found at `data/test_conversational_implicatures.csv` and the dev set at `data/dev_conversational_implicatures.csv`.
The data is given in utterance, response, implicature tuples and these can be wrapped in the prompt templates with the code, as detailed below.
The type labels for the error analysis are in `data/type_labels.csv`.

## The human evaluation
As described in Appendix E in the paper, the human evaluation is done by dividing the data in four subsets and having five annotators
annotate each of the 150 examples in a subset (together giving 600 examples annotated by 5 unique annotators each). 
The subset files with annotations can be found in the folder `human evaluation`.

## The prompt templates
The prompt templates used for the zero-shot and few-shot evaluations can be found in `data/prompt_templates.csv`. These
are the six main prompt templates. The additional prompt templates used for the extra zero-shot experiment are in the file
`data/alignment_prompt_templates.csv`. Examples can be wrapped in prompt templates with the code as detailed below.

## All results
File `results/preview_results.json` contains a preview of the results grouped together produced by the head command 
on the full results file `head -50000 all_results.json`. The full results file (`all_results.json`) is too big to add
to the supplementary materials, but will be released after the anonimity period is over. For now, the preview can be checked
to check the predictions and the numbers reported in the paper. For example, the results for column `k = 0` of OpenAI-175b (text-davinci-001) in Table 19 in the paper
are at the top at the file:
```json
"openai-davinci1": {
            "mean_accuracy": 72.30555555555556,
            "std": 2.8274721511328975,
            "template_results": {
                "prompt_template_1": 76.5,
                "prompt_template_2": 72.0,
                "prompt_template_3": 74.83333333333333,
                "prompt_template_4": 68.0,
                "prompt_template_5": 72.5,
                "prompt_template_6": 70.0
            }
```
Directly below the predictions per example per template are shown, for example the first one:
```json
"prompt_template_1": {
                    "0": {
                        "id": 0,
                        "original_example": {
                            "source": "",
                            "type": "no",
                            "utterance": "Is Marci grumpy?",
                            "response": "he's as gentle as a lamb",
                            "implicature": "no"
                        },
                        "true": "no",
                        "pred": "no",
                        "correct": 1,
                        "prompt_examples": []
                    }
```

We have separate results per model but these will be released after anonimity period is over.
This is due to usage of a huggingface personal dataset identifier that is present in all the separate files that makes authors identifiable.
All results, plots, and tables can be produced from the `all_results.json` file.

## Running evaluations with the code

Before running, go over the check installation section below.

The exact evaluations done for the paper cannot easily be reproduced without:
(1) having access to OpenAI or Cohere credits,
(2) having access to enough compute to run the large open source models.

### Check installation

Make sure the directory `code` is your current working directory.

Developed with Python 3.9.10, so make a virtual environment with this version.

Rust is a dependency for `transformers` library, install compiler with:

```bash
>> curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
```

For mac with `arm64` the following requirements for `sentencepiece` also need to be installed :

```bash
brew install cmake
brew install gperftools
brew install pkg-config
```

Python requirements:

```bash
>> python -m pip install -r requirements.txt
```

To run tests for all the code:

```bash
>> pytest tests/.
```

To run an end-to-end code check:

```bash
>> chmod a+x test_code_runs.sh
>> ./test_code_runs.sh
```

Expected output should be:

```bash
This script should not take more than a minute to run.
PASSED
```

Note that to run any evaluations you need to have OpenAI and Cohere's API keys. Add these keys
to the two files in the folder `static` called `cohere_api_key.txt` and `openai_api_key.txt`. The former just needs
a single line with the key, the latter has the organization key on the first line and the API key on the second.


### Smaller open source models
The open source models that are small enough to run on your local machine can be run as follows.

For example, to run 100 examples through GPT-2 medium, run the following command:
```bash
> python -m src.probe_llm +experiment=particularised ++model_ids=gpt2-medium ++objectives=lm ++max_num_evaluations=100
```
Which will produce the results in a folder named `results`, this folder will also contain the exact text passed through the language model, wrapped in templates and coherent and incoherent.
The command should produce the following stdout:

<details>
<summary>stdout</summary>
<br>

```bash
[2022-09-25 13:40:21,437][root][INFO] - Logging used config:
[2022-09-25 13:40:21,437][root][INFO] - --------------------------------------------------
[2022-09-25 13:40:21,437][root][INFO] - type_implicature: particularised
[2022-09-25 13:40:21,437][root][INFO] - test_input_data_path: data/test_conversational_implicatures.csv
[2022-09-25 13:40:21,437][root][INFO] - test_control_input_data_path: data/test_conversational_implicatures.csv
[2022-09-25 13:40:21,437][root][INFO] - dev_input_data_path: data/dev_conversational_implicatures.csv
[2022-09-25 13:40:21,437][root][INFO] - dev_control_input_data_path: data/dev_conversational_implicatures.csv
[2022-09-25 13:40:21,437][root][INFO] - use_degenerate_dataset: False
[2022-09-25 13:40:21,437][root][INFO] - get_prediction_bias: True
[2022-09-25 13:40:21,437][root][INFO] - batch_size: 16
[2022-09-25 13:40:21,437][root][INFO] - seed: 0
[2022-09-25 13:40:21,437][root][INFO] - prompt_file: data/prompt_templates.txt
[2022-09-25 13:40:21,437][root][INFO] - logging_frequency: 100
[2022-09-25 13:40:21,437][root][INFO] - device: cpu
[2022-09-25 13:40:21,437][root][INFO] - results_folder: results
[2022-09-25 13:40:21,437][root][INFO] - max_num_evaluations: 100
[2022-09-25 13:40:21,437][root][INFO] - skip_until_idx: -1
[2022-09-25 13:40:21,437][root][INFO] - model_ids: gpt2-medium
[2022-09-25 13:40:21,438][root][INFO] - objectives: lm
[2022-09-25 13:40:21,438][root][INFO] - task: ranking
[2022-09-25 13:40:21,438][root][INFO] - k_shot: 0
[2022-09-25 13:40:21,438][root][INFO] - prompt_class: ExamplePrompt
[2022-09-25 13:40:21,438][root][INFO] - --------------------------------------------------
[2022-09-25 13:40:21,442][root][INFO] - Prompt variation 0:
[2022-09-25 13:40:21,442][root][INFO] - 
Does the following response to the question imply yes or no?

question: Have you found him yet?
response: We're still looking.
implicature: no
[2022-09-25 13:40:21,442][root][INFO] - Prompt variation 1:
[2022-09-25 13:40:21,442][root][INFO] - 
Finish the following text:

Esther asked "Have you found him yet?" and Juan responded "We're still looking.", which means no
[2022-09-25 13:40:21,442][root][INFO] - Prompt variation 2:
[2022-09-25 13:40:21,442][root][INFO] - 
Is the implied meaning of the following response yes or no:

question: Have you found him yet?
response: We're still looking.
meaning: no
[2022-09-25 13:40:21,442][root][INFO] - Prompt variation 3:
[2022-09-25 13:40:21,442][root][INFO] - 
What is the intent of the following response, yes or no?

question: Have you found him yet?
response: We're still looking.
intent: no
[2022-09-25 13:40:21,442][root][INFO] - Prompt variation 4:
[2022-09-25 13:40:21,442][root][INFO] - 
Finish the following text:

Karen asked "Have you found him yet?" and William responded "We're still looking.", which means no
[2022-09-25 13:40:21,442][root][INFO] - Prompt variation 5:
[2022-09-25 13:40:21,442][root][INFO] - 
Finish the following text:

Bob asked "Have you found him yet?" and Alice responded "We're still looking.", which means no
Using mask_token, but it is not set yet.
Using sep_token, but it is not set yet.
Using batch support for gpt2-medium
[2022-09-25 13:40:31,034][root][INFO] - Processing Model: gpt2-medium
[2022-09-25 13:40:50,119][root][INFO] - Processed 399/7200 datapoints.
[2022-09-25 13:41:10,322][root][INFO] - Processed 799/7200 datapoints.
[2022-09-25 13:41:30,249][root][INFO] - Processed 1199/7200 datapoints.
[2022-09-25 13:41:31,050][root][INFO] - Hit max num evaluations 100.
[2022-09-25 13:41:31,051][root][INFO] - Wrote data to: results/data_ranking_task_0_nprompts_6_npromptvars_2022-09-25 13:40:31.034449.json
[2022-09-25 13:41:31,052][root][INFO] - Wrote results to: results/results_ranking_task_0_nprompts_6_npromptvars_1_models_2022-09-25 13:40:31.034494.json
```
</details>

Note that this code does more evaluations than 100, because each example is wrapped in 6 templates and a coherent and incoherent example needs to be passed through the language model.
To run on the full dataset, remove the `max_num_evaluations` flag from the command.

### API models: OpenAI and Cohere 

If you do have an OpenAI or Cohere key, place the former at the top of `src/models.py`:
```python
openai.organization = "<INSERT PERSONAL KEY>"
openai.api_key = "<INSERT PERSONAL KEY>"
```
And the latter in `static/cohere_api_key.txt`. Then the same command as above can be used
to run evals with OpenAI and Cohere models by replacing the `model_id` with the right identifier. For example:

NB: running the following command with a proper API key may cost money!

```bash
> python -m src.probe_llm +experiment=particularised ++model_ids=openai-davinci ++objectives=lm
```

and for Cohere:

```bash
> python -m src.probe_llm +experiment=particularised ++model_ids=cohere-xl ++objectives=lm
```

### Bigger open source models

These evaluations are run with [EleutherAI's eval harness](https://github.com/EleutherAI/lm-evaluation-harness), and to use this framework
the dataset needs to be available on Huggingface. To keep anonymity, we cannot release
the name we published the dataset on huggingface under, and thus will for now only provide
the command to run without the dataset identifier. The identifier will be added after the anonimity period is over.

EleutherAI's eval harness allows running evaluations on large models that need to be loaded on multiple GPUs easily. For example with the following command:

```python
python main.py --model_api_name 'hf-causal' --model_args pretrained=facebook/opt-2.7b --task_name {task_name}/${k}-shot  --template_names 'template_1,template_2,template_3,template_4,template_5,template_6' --device gpu
```

