# Reproducing NLG evaluation experiements 

### Get processed version of the data

Run ``preprocessing_<dataset>.py``

### Check Depth score

Depth score and other baselines can be found in 


Check ``depth_score.py``

### Benchmark a metric on a specific dataset 

Run ``main_<dataset>.py`` <br>

Example with a slurm queue can be found in ``caviar.sh``

### Download the data from source:

- CNN dataset can be found here https://github.com/neulab/REALSumm
- WebNLG data can be found here  https://webnlg-challenge.loria.fr/workshop_2020/

#### Not reported in the paper (space constraints) but suported in the framework
N.B : DepthScore performs really well on many benchmark other metrics including on:

- TAC has to be bought (not free) 
- Quora duplicate questions can be found here  https://www.kaggle.com/c/quora-question-pairs
- Quora - multiling can be found here https://github.com/google-research-datasets/paws
- MSR dataset can be found here https://www.microsoft.com/en-us/download/details.aspx?id=54262
- hotel / bagel can be found here https://github.com/UFAL-DSG/tgen
- WMT15 and WM16 can be found here http://www.statmt.org/wmt15/results.html and here respectively http://www.statmt.org/wmt16/results.html.
- Coco dataset can be found here https://cocodataset.org/


### Dependancies :

This project requieres the following dependancies:
- Transformers : https://huggingface.co/transformers/  
- SummEval : https://github.com/Yale-LILY/SummEval
- Pyter : https://pypi.org/project/pyter/

We thanks the authors of these package for their contributions to our research.

#### Infrastructure:

All can be run on a single NVIDIA-V100  + Some GPUS but may takes times to run all experiements (below 2hours)