# Goal-Driven Discovery of Distributional Differences via Language Description 

This repo contains the implementation of our D5 system. However, since it is complicated to transfer the 44GB model weight, we are replacing it with a DummyVerifier that returns random value. 

## Setup

We highly recommend you to run this in a conda environment.

Run 

```
pip install torch --extra-index-url https://download.pytorch.org/whl/cu116
pip install -r requirements.txt
python3 -m nltk.downloader punkt stopwords averaged_perceptron_tagger
```

If you want to run ```lm_proposer.py```, you need to set the ```openai_key``` environment variable.

## Problem Representation

```example_problem.pkl``` contains an example problem represented as a dictionary. Load with the following line:

```problem = pkl.load(open('example_problem.pkl', 'rb'))```

Each problem contains: 
- A corpus pair. Samples from the research of Corpus A are in ```problem['split']['research']['A_samples']```. Similar for Corpus B and the validation split.
- Research goal and how the data was collected: all the other fields. Refer to Figure 1 in the paper and the proposer prompt template in ```templates/gpt3_proposer.txt``` to interpret the values.


## D5 System

(Each of the python file mentioned below is individually runnable to illustrate their functionality)

Run the command ```python3 global_run.py``` to see how to run our system, and it will describe the differences between news from 2020 and 2019. 
We chose this as an example to illustrate our code since there is a simple discovery "talks more about Covid". 

The overall logic is implemented in ```D5.py```, and the D5 class can take in (ordered) lists of samples from Corpus A and Corpus B, a proposer that can map two groups of sentences to a list of hypotheses, and a verifier that can compute *T'(h, x)*.
The D5 class also offers other functionalities, such as stop computing the V values for hypotheses that are not promising.

```lm_proposer.py``` includes the implementation for prompting GPT-3 to propose hypotheses. 

```verifier.py``` includes the implementation for computing the V value for each hypothesis. Notice that ```global_run.py``` is currently using a **dummy verifier** that returns random results. 

```get_representative.py``` sort the samples by how "representative" they are for the clusters they came from.

```h2h_dicts.pkl``` is an example output of our D5 system, which is a mapping from a hypothesis *h* to a list of information, which includes:
- ```sample2score```: a mapping from samples *x* to the *T'(h, x)* score
- ```provenance```: how the hypothesis was generated, e.g., what GPT-3 prompt and hyperparameter led to this hypothesis
- ```diff_w_significance```: *V*-value, along with its confidence interval and p-value of *V' = 0*

## Section 5: Quantitative Evaluation

### Results on SynD5
Change directory to ```neurips2023_synd5``` to reproduce the results on SynD5. You can find the instruction in README.md. 

### Relevance evaluation on OpenD5
Our relevance evaluation for the hypotheses are in ```sec5_opend5/meaningfulness_eval.csv```.
Run 

```python3 sec5_opend5/meaningfulness_eval_statistics.py``` 

to reporduce the evaluation we conducted in the paper.

```sec5_opend5/goals.json``` contains the full list of research goals for each problem of OpenD5 when this evaluation was run. 

## Discoveries in Section 6.1

```sec6_discoveries/discoveries.json``` contains all the discoveries section 6.1 produces, along with their V' value and p'-value. 

Run ```python3 sec6_discoveries/read.py``` to read an example discovery.
