## INTERPRETING LANGUAGE REWARD MODELS VIA CONTRASTIVE EXPLANATIONS

This supplemental material contains the utility codes required to obtain the results in the paper. 

To execute the codes, please fill in the parts in utils.py marked with "anonymised", which are some file paths and configurations for making OpenAI API Calls. Please note that we used local file paths to load the open-source reward models and the datasets.

Core packages include OpenAI's API, torch, transformers, datasets, sentence-transformers, spacy, polyjuice.

utils.py contains functions for processing datasets and getting randomly-selected test sets, and running all explanations generation methods for one test set.

In /quantitative analysis, notebook 1 provides an example for generating explanations for one RM and (multiple test sets of) one dataset using all methods. This should be repeated for three RMs and three datasets, before running notebook 2 which produces all the table results in Section 3 and appendices.

In /qualitative analysis, notebook 1 provides an example for generating explanations using our method for one dataset, where the test sets are filtered to find test comparisons for which all three RMs agree on their preferences. This should be repeated for three datasets, before running notebook 2 which performs global sensitivity analysis and representative examples extraction, as described in Section 4.