# Bring Your Own Data! Self-Supervised Evaluation for Large Language Models

This anonymous repository provides the implementation of ICLR submission #8000: Bring Your Own Data! Self-Supervised Evaluation for Large Language Models.



## Dependencies

* transformers==4.28.1
* scipy==1.10.1
* torch==2.0.0
* datasets==2.11.0
* nltk==3.8.1
* apache_beam==2.48.0

Python 3.8 or higher is recommended

## Usage

See `run_model.sh` for examples on how to evaluate a model. We provide scripts to run all huggingface models against metrics computed on wikipedia data, as an example. These are named `run_[metric].py`.

Note that only models are huggingface are currently supported.


You can also use the metrics directly, given your own `model`, `tokenizer`, and `dataset`, like so
```
import BYOD

long_range_sensitivity = BYOD.lrs_metric(model, data, tokenizer)
negation_knowledge = BYOD.negation_metric(model, data, tokenizer)
tokenization_robustness = BYOD.tokenization_metric(model, data, tokenizer)
toxicity_proxy = BYOD.toxicity_metric(model, data, tokenizer)
word_order_sensitivity = BYOD.word_order_metric(model, data, tokenizer)
```

