
# RAINPROOF


Our experiments work in three steps. 
- First we generate the reference set data, for Mahalanobis we approximate the covariance matrix and the mean and for our information projections we just need to store the bags of distributions.

- Then we run the experiments for each pair of ref/shifts

- Then we summarize our results in the Global tables and figures jupyter notebook.

## Building the reference sets

See `run_references.sh` for all the pairs.

```
for size in 10 100 250 500 750 1000 1500 2000; do
	python scripts/mk_mahalanobis.py \
		--dataset_name "Helsinki-NLP/tatoeba_mt" \
		--dataset_config "deu-eng" \
		--dataset_split test \
		--model_name "Helsinki-NLP/opus-mt-de-en" \
		--output_dir data/mahalanobis/de-tatoeba/ \
		--size $size
done

python scripts/mk_distribs_signatures.py \
	--dataset_name "Helsinki-NLP/tatoeba_mt" \
	--dataset_config "deu-eng" \
	--dataset_split test \
	--model_name "Helsinki-NLP/opus-mt-de-en" \
	--size 10000 \
	--output_dir "data/iprojrefs"
```

## Experiments

Usage of `scripts/textood.py`:
```
python scripts/textood.py --help
usage: Description [-h] [--dataset_name DATASET_NAME] [--dataset_config DATASET_CONFIG] [--dataset_split DATASET_SPLIT] [--compute_dist_mahalanobis COMPUTE_DIST_MAHALANOBIS] [--compute_dist_set COMPUTE_DIST_SET]
                   [--model_name MODEL_NAME] [--output_dir OUTPUT_DIR] [--shuffle_input] [--switch_lang] [--size SIZE] [--num_beams NUM_BEAMS]

options:
  -h, --help            show this help message and exit
  --dataset_name DATASET_NAME
                        Dataset name
  --dataset_config DATASET_CONFIG
                        Dataset config
  --dataset_split DATASET_SPLIT
                        Train, validation or test
  --compute_dist_mahalanobis COMPUTE_DIST_MAHALANOBIS
                        Path to mahalanobis reference file.
  --compute_dist_set COMPUTE_DIST_SET
                        Path to mahalanobis reference file.
  --model_name MODEL_NAME
                        Huggingface model name
  --output_dir OUTPUT_DIR
                        Where to store the results.
  --shuffle_input       Size of the beam search. Has to be > 1 since the point of this is to select the best hyps among those.
  --switch_lang         Wether to swap target lang and source lang
  --size SIZE           Maximum size of the dataset to use.
  --num_beams NUM_BEAMS
                        Size of the beam search. Has to be > 1 since the point of this is to select the best hyps among those.
```


Example:
```
python scripts/textood.py \
	--num_beams 4 \
	--dataset_name "Helsinki-NLP/tatoeba_mt" \
	--model_name "Helsinki-NLP/opus-mt-nl-en" \
	--dataset_config "eng-nld" \
	--dataset_split test \
	--compute_dist_mahalanobis data/mahalanobis/nld-tatoeba \
	--compute_dist_set data/iprojrefs/distribs_signature-Helsinki-NLP-opus-mt-nl-en-nld-eng-Helsinki-NLP-tatoeba_mt-test.dat \
  --output_dir data/lshift/ \
  --switch_lang \
	--size 3000'
```

The data should be stored in `data/`.


