bigrams_sample/trigrams_sample.xlsx:


Contains a list of 500 bigrams/trigrams that were sampled uniformly from the bigrams/trigrams that made it into our PMI_n list. Both lists are ranked by their PMI_n, and we have annotated them manually, listing “NAME” next to named entities. 


top_1000_uni_masked.xlsx:


Contains two lists of the 1000 whole word tokens that were masked as unigrams the most, in Random-Token Masking and in PMI-Masking. This list was attained over a sample of 10% of the pretraining corpus.  


span_length_dist_random_token_masking/_random_span_masking/_pmi_masking.png


Contains the percentage of spans masked per different span length, in the different masking approaches. 


glue_diagnostics.xlsx


Contains the scores on the different linguistic evaluation tasks of the PMI-Masked model and Random-Span Masked model trained for 2.4M steps over the 54G corpus (our main model and its baseline that were submitted to the GLUE benchmark system).