# Code soumission for our paper entitled: Towards More Robust NLP System Evaluation: Handling Missing Scores in Benchmarks 

## Hardware requirements:

All our experiements can be run on CPUs. To reproduce all the results of our paper we estimate the total runtime to 1k CPU hours. 
To reduce this requirement we provide notebooks already executed.

## Code overview:

Start by installing our requirements:
```pip install -r requirements.txt```

Our code is organized as follows:

1.  `partial_rankings.py`: Utility functions for calculating the borda rankings with missing values
2.   `task_level_calculate_correlations.py`: For calculating the rankings at the task level and conduct the robustness analysis
3.    `task_level_plot_correlations.py`: For plotting the results of the robustness analysis at the task level
4.    `instance_level_calculate_correlations.py`: For calculating the rankings at the instance level and conduct the robustness analysis
4.    `instance_level_plot_correlations.py`: For plotting the results of the robustness analysis at the instance level
5.    `confidence_intervals.ipynb` : For the confidence intervals analysis

## Upon Acceptance:

We commit to release the code and the collected datasets. We will further provide a command line tool to foster adoptability of our method.
