Our code repository is based on the public huggingface transformers library. Unfortunately we currently cannot
provide the full runnable code as we are unable to share all the necessary assets (datasets and checkpoints)
in an anonymized fashion however we fully intend to publicly release this code including all assets with full
reproducibility.

Code Sections
For easy readability, we include a short pseudo code which describes how masks for our algorithm
are computed in the adjacent masking_pseudo_code.py file.
The actual code for this process can be found in "full_code/src/transformers/models/perceiver_dap/transforms_dap.py"
under the RandomlySelectedCrossAttentionMasking class.

The main code for our modeling and algorithm infrastructure exists in 
"full_code/src/transformers/models/perceiver_dap." (DAP standing for "domain-agnostic-pretraining)
Here, the main two files are "modeling_dap.py" and "transforms_dap.py". In "modeling_dap.py" we implement the main model architecture
as well as a system for implementing and training self-supervised models. We design the systems to be easily
extendible to new methods and training algorithms while providing enough shared structure to support common
operations such as reconstruction decoders and support for transforms. These different transforms are located
in "transforms_dap.py." Our main proposed transform "RandomlySelectedCrossAttentionMasking" which implements the
mask generation algorithm described in our main paper. In addition we implement other masking baselines.

The code to run each individual experiment is located in "full_code/examples/research_projects/domain-agnostic-pretraining."
"discrete_pretraining" contains the script to run text, protein, and chemistry SMILES sequence pretraining experiments.
"bio_finetuning" contains the code to run downstream tasks on the TAPE benchmark.
"text-classification" contains the code to assess language models on the GLUE benchmark.
"chem-finetuning" contains the script to run downstream tasks on the MoleculeNet classification and regression tasks.
"higgs-pretraining" contrains the code used to pretrain the particle physics models.
"higgs_classification" contains the code to run the HIGGS tabular classification tasks.
"image_pretraining" contains the script to run image pretraining experiments.
"image_classification" contains the code to run the ImageNet100 downstream classification benchmark.
"saved_models" contains the configurations used for different models and architectures.

Thank you for taking the time to examine our code and we hope it can clarify any remaining questions about the setting of our experiments!