# NEWS-COPY
Noise-robust de-duplication at scale 

Code for our paper "Noise-Robust De-Duplication at Scale"

### Neural
Contains 
- biencoder training 
- crossencoder training 
- evaluation of both of the above on our labelled dataset

### Rule-based
- n-gram overlap
- LSH
- evaluation of both of the above on our labelled dataset


### Inference at scale
- biencoder inference over C4 
- LSH inference over C4
- biencoder and LSH over SuperGLUE 

C4 can be downloaded thanks to AllenAI - see https://github.com/allenai/allennlp/discussions/5056