# Exploring LLMs

If any of the instructions here are not working for you or you have to tweak something, please ping me (Till) and/or update the instructions to describe your solution.


## Setup

To clone the project, use `git clone --recurse-submodules -j8 <url> <project_name> && cd <project_name>`.

You can pull the latest updates (including from the submodule libraries) using `git pull --recurse-submodules`.

### Dependency Management

The project uses [Poetry](https://python-poetry.org/) to manage dependencies.
Poetry is more explicit than pip about how it keeps track of dependencies and creates virtual environments in away from the actual project, which makes it easier to maintain projects under different environments.
To install Poetry, consult [this guide](https://python-poetry.org/docs/master/#installing-with-the-official-installer).

Alternatively, to just execute the code, you can skip the setup by using Docker as described below (assuming Docker is setup on your system).


## Usage

### With Docker (recommended)

Assuming Docker is setup correctly on your system, simply run `./scripts/run_docker.py` to build a container with all the dependencies and start it up into a shell.
You can also run `./scripts/run_docker.py <command>` to execute a command directly in the container, e.g. `./scripts/run_docker.py -i python src/main.py` to start a notebook server.
Note that the cluster provides rootless docker, so these commands should work out of the box.

To restrict the set of GPUs that the container should use, you can append the `--gpus=id1,id2` option to `run_docker.py`.

The `src` directory is mounted, not copied, inside the container.
This way, all changes made on the host and from inside the container are reflected on the other.
You can add mount options in the `run_docker.py` script to add additional directories inside the container.

### Without Docker

If you have setup the dependencies correctly, you can also execute the code directly on the machine.
To do that, first activate the virtual environment with `poetry shell`.
Then you can execute commands, e.g. `python src/main.py` to start a notebook server.

### Running experiments

To run a specific experiment, execute `[./scripts/run_docker.py] deepspeed src/main.py +sid=<seed_id> +<experiment_name>=<config_name>`, where `<experiment_name>` is a shortcode specifying which experiment you want to run, and `<config_name>` is the name of one of the configurations for the experiment.
For example, you can run the "Memorization Hyperparameter Relationship" (MHR) experiment in the test configuration via `./scripts/run_docker.py deepspeed src/main.py +sid=0 +mhr=test`.
`deepspeed` is used to enable training (large) LLMs.

You can override configuration options on the command line with `++`, i.e. by appending `++<flag>.<subflag>=<value>`.

You can also execute multiple configurations sequentially by adding the `--multirun` or `-m` flag.
For example, to run configurations `test1` and `test2` for the experiment above, you can use `./scripts/run_docker.py deepspeed src/main.py -m +sid=0 +mhr=test1,test2`.

### Accessing results and other output

The output of experiments (checkpoints, results) is (supposed to be) stored inside the `artifacts` directory.
Logs are written to the `logs` directory.


## Contributing to the library

The project uses [git pre-commit hooks](https://pre-commit.com/) to maintain code style and integrity, i.e. for code formatting, linting and unit-testing.
Before making any commits, please install `pre-commit` as described below.
The unit tests run inside a Docker container, so make sure you have Docker installed as well (see below).

When using Poetry, install the library dependencies (separately from the parent project) with `poetry install --with dev` and then run `poetry run pre-commit install` to setup the pre-commit hooks.
The process for pip is analogous (TODO: specify the exact pip instructions once/if you use it).

Now, the pre-commit hooks should automatically run before every commit.
To manually run the pre-commit hooks, execute `poetry run pre-commit run --all-files`.


## Step by step guide

The following guide shows how to setup the code and run one of the experiments in detail.

1. Clone the repository and the submodule libraries it depends on with `git clone --recurse-submodules -j8 <url> <project_name> && cd <project_name>`.
2. (Optional) Installing dependencies. This step is optional, since the code can also be run with Docker without installing dependencies, but highly recommended if you plan on changing anything.
    - Install Poetry as described [here](https://python-poetry.org/docs/master/#installing-with-the-official-installer).
    - Change the cache directory. By default Poetry stores virtual environments inside the home directory, which shares storage space with other users, which can quickly overflow and cause problems. Instead, run `poetry config cache-dir /NS/venvs/nobackup/<your username>/pypoetry_cache` to point it to a better cache directory. Note that you might have to create this directory on the NFS first.
    - Then, create a virtual environment and install the project's dependencies by running `poetry install --with dev` inside the main project directory.
    - If you're committing changes, please install the pre-commit hooks first by running `poetry run pre-commit install`. The hooks will run before every commit to ensure a uniform code-style.
3. Configuration: CAVEAT: don't commit your changes to the files below, else they will trigger merge conflicts and/or overwrite other people's configurations. I'll try to find a better solution for this.
    - Update the values in `conf/config.yaml`. In particular, change the `username`, `results_root_dir` and `results_url_prefix` values.
    - If you want to use the sync script to sync data between your local machine and the NFS on the server, edit the `scripts/sync.sh` file. Change the `SERVER_DIR` and `WORKSPACE_DIR` values to point to the paths where you would like to store the project data on the server and locally, respectively.
4. Sync your code to the server (if you're running locally):
    - Once: If you haven't done so, I would recommend adding your public SSH key to the server. You can do so using `ssh-copy-id -i ~/.ssh/id_rsa.pub <username>@<redacted>`.
    - Once: SSH into the contact server (`ssh <username>@<redacted>`) and create the project directory on the server, e.g. under `/NS/llm-1/<username>/llms`
    - Once: For each item in `[src/, conf/, scripts/, pyproject.toml, poetry.py]`, run `./scripts/sync.sh to-svr <item>`, e.g. `./scripts/sync.sh to-svr src/`. Important: always add the `/` at the end of folder names like `src/`, else the folder will be nested in some undesirable way. I know the sync script needs some love, I'll try to look into that at some point.
    - If you make some changes to locally that you want to sync to the server, e.g. in `src`, you can push them using `./scripts/sync.sh to-svr src/`.
    - If you want to download data from the server to your local machine, for example artifacts generated by an experiment, you can do so using `./scripts/sync.sh from-svr artifacts/`.
5. a) Run the Memorization-Hyperparameter Relationship (MHR) experiment on the server:
    - <redacted>
    - SSH into one of the compute servers, e.g. `ssh <username>@<redacted>` and navigate to your project directory on the NFS.
    - Once: You might want to configure [Weights & Biases](https://wandb.ai/site). Once you've done that and have an API key, you should put it into a file under `$HOME/.config/dot_envs/wandb.env` with the content `WANDB_API_KEY=<key>` in your NFS home directory. The definitions in that fine will be accessible as environment variables inside the Docker containers.
    - Run the test configuration of the MHR experiment with Docker using `./scripts/run_docker.py --gpus=0 deepspeed src/main.py +sid=0 +mhr=test`. Ideally the code should run through and you should see some dummy training statistics at the end. At this point "the code works :)".
    - Results like evaluation statistics and model checkpoints are stored inside the `artifacts` directory (though this particular experiment does not save model checkpoints, since they're relatively quick to recompute and take up a lot of space).
5. b) Run the MHR experiment on SLURM:
    - An alternative to running on the interactive SSH machines is to run on the SLURM [Compute Grid](<redacted>), which usually has more free capacity.
    - To run jobs there, SSH into `<redacted>` by running `ssh <uername>@<redacted>`.
    - Run `squeue` to see running jobs.
    - Run `sbatch scripts/slurm/single_a40.sh deepspeed src/main.py +sid=0 +mpr=test` to start a job. Configuring and starting the node can take a while, run `squeue` to monitor the progress.
    - The output written by the jobs is stored inside the `logs` directory.
6. Evaluating and sharing results
    - If you want to evaluate results locally, you should download them first using `./scripts/sync.sh from-svr artifacts/`. But it's also possible to run the evaluation on the server.
    - Start a notebook server using `./scripts/run_docker.py -i python src.main.py`, either on the server or locally. You might have to setup Docker first on your local machine.
    - If you are running the notebook server on the cluster, you need to enable port-forwarding to be able to access it from your browser. For that you can simply run `./scripts/connect_server_notebook.sh`.
    - Then you can just run notebooks normally. For example, for the MHR experiment, you can open and run the notebook under `src/experiments/memorization_hyperparam_rel/notebooks/data_params.ipynb` which analyzes the effect of training hyperparameters on memorization.
    - NOTE: the first cell in that notebook does some necessary setup and should be run exactly once (it changes the working directory to make imports work properly, so if you rerun it, you have to restart the kernel). You should also copy the code in that cell to other notebooks if you create new ones.
    - The last cell in the notebook exports it as HTML (without the code, just the markdown and cell outputs) and uploads it to your personal webspace. If you get file not found errors you probably have to first create the respective directories under `public_html`. The code prints a link that you can share with others and embed on the Wiki.
    - NOTE 1: You can use [Plotly](https://plotly.com/python/) to generate interactive plots in the notebook. Those plots will be exported to HTML along with the notebook. Usually the results are preferable to Matplotlib plots.
    - NOTE 2: The export and upload code only uploads the last saved version of the notebook, so you should save it (CTRL + S) before running the upload.
    - NOTE 3: The HTML files are uploaded using SSH. I usually run the code from my local machine and have forwarded the SSH agent into the Docker container accordingly. If you run the notebook server on the cluster, you might have to make your SSH key available there as well.


## Design Rationales

### Configuration

The project uses [Hydra](https://hydra.cc/) for configuration management.
For a quick intro to Hydra, see [this guide](https://hydra.cc/docs/intro/).

For specifying configuration throughout the code, e.g. the configuration for a training task, the preferred way is using [Python dataclasses](https://docs.python.org/3/library/dataclasses.html)

### Experiment structure

All experiments are located under `src/experiments/`.
There is a folder for each experiment containing the actual code to run it in `experiment.py`, different configurations that run the experiment in different ways in `config.py`, possibly code to generate or load experiment specific data under `data.py`, utilities for loading, visualizing and sharing results in `results.py` and notebooks for specific types of analysis.

Importantly, the `config.py` file of each experiment is supposed to define a handle object that is used by the `main.py` file to register and dispatch experiments.
For example, the Memorization Hyperparameter Relationship (MHR) experiment defines an `MHRHandle` object inside its `config.py` file which is imported by `main.py` and added into the `experiment_handles` array, such that it can be invoked from the command line.
The handle objects should define an `id` field, e.g. `mphr`, which is then used to determine which experiment to run.
E.g. if you run `./scripts/run_docker.py --gpus=0 deepspeed src/main.py +sid=0 +mhr=test`, the code will dispatch the MHR experiment with the test configuration.

If you need to create a new experiment, the easiest way would be to copy the folder of one of the existing experiments that matches the new one most closely and modify it accordingly to your needs.
In particular, update the `id` field in the experiment handle in the `config.py` file and register the new experiment in `main.py`.

### Running experiments vs analyzing results

The experiment setup decouples experiment execution from result analysis.
The experiments are run through the CMD commands (e.g. `./scripts/run_docker.py deepspeed src/main.py +sid=0 +mhr=pyt-1b_alph-latin-4` for the 4 character experiment), and they store results (mostly Pandas Dataframes with probabilities or loss values in this case) inside files in the `artifacts` directory (i.e. for the 4 character experiment in `artifacts/memorization_hyperparam_rel/pyt-1b_alph-latin-4/sid_0/result.pkl`).

After an experiment is done running, you can analyze its results, preferably from a Jupyter notebook (you can start one using `./scripts/run_docker.py -i python src/main.py`).
For example, code for analyzing the experiment above can be found in `src/experiments/memorization_hyperparam_rel/notebooks/data_params.ipynb`.
The `res_util.show_constrained_results(...)` code reads the result files generated by running the experiment and plots them, but doesn't rerun anything.
More precisely, it loads the results for all configurations in the `cfg.ALPHABET_ARGS` config dict (those are the configurations that vary the alphabet size) and then plots those corresponding to the 1B models (the `constraints={"model_id": "pyt-1b"}` part).


## Troubleshooting

### HDF5 Error when installing (py)tables
- Make sure you have HDF5 installed by running `brew install hdf5`
- Make sure poetry will find the paths, run:
    - `export HDF5_DIR=/opt/homebrew/opt/hdf5; export BLOSC_DIR=/opt/homebrew/opt/c-blosc`
    - https://stackoverflow.com/questions/73029883/could-not-find-hdf5-installation-for-pytables-on-m1-mac
- Then rerun the `poetry install`
