# Transformer Pooling Experiments

Use pretrained LLMs/Embedding Models and finetune them on downstream tasks with different poolins. 

- Use hydra for config management.
- HuggingFace libraries for models with PyTorch backend.
- Pre-training code and scripts for GPT-2 models.

## Run

The entry point is the `run_llm.py` script, to run a specific model/data/pooling experiment run with the following params: 

`torchrun --nproc_per_node=[NUM_GPUs] run_llm.py +datasets=[DATASETS] backbone=[BACKBONE] pooling=[POOLING] zero_shot_eval=false runs=5`

All configuration parameters are in `configs/general.yaml` with dataset specific options in the `/datasets` subfolder.

### GPT-2 Pretraining

Our experiments used GPT-2 checkpoints that were pretrained on the OpenWebText corpus. The training code and model structure is under `src/nanoGPT`. The `src/nanoGPT/data/openwebtext/prepare.py` script downloads and prepares the data corpus. The `src/nanoGPT.train.py` script is used for the pretraining. The script saves the checkpoints and the configuration file is under `src/nanoGPT/config/config.yaml`. The default settings were used, except for the `l2_mha` flag. This controls whether the model uses a default scaled dot product self attention, or an L2 attention kernel. Our experiments used pretrained models for both settings.