# Reproducibility Instructions

This repository contains code and resources corresponding to the three research questions (RQ1–RQ3) in our paper.

---

## RQ#1: Pretraining Data

- Download the variant frequencies from this link:  
  [Google Drive – Variant Frequencies](https://drive.google.com/file/d/1ZvPFcnTTXbQHDNMmkS86AiGD7GuYdN9E/view?usp=sharing)  
- Extract the files and place them in the `Pretraining Data` directory.  
- Run the code in that directory to compute the probabilities and generate the results and plots.  

---

## RQ#2: Tokenization

- The code for the individual models is provided in Jupyter notebooks.  
- Simply run the notebooks to reproduce the results and plots for this section.  

---

## RQ#3: Generated Text Alignment

- The implementation of **DiAlign** is provided in `diAlign_score.py`.  
- Validation data and the set of questions across different registers are included in this directory.  

---

## Resources

- The curated AmE–BrE variant pairs are available in the `Resources` directory.  

---

## Notes

- The classifier code for variant grouping is located in the `utils` directory.  
- The preprocessing code used for preparing the pretraining datasets is also located in the `utils` directory.  
