# Exploring Memorization in LLMs

## Step by step guide

The following guide shows how to setup the code and run one of the experiments.

### Setup

1. Clone the repository and the submodule libraries it depends on with `git clone --recurse-submodules -j8 <url> <project_name> && cd <project_name>`.
    - You should use SSH-based authentication to pull the project, there will be errors with the submodules otherwise. Here are guides for [generating](https://docs.gitlab.com/ee/user/ssh.html#generate-an-ssh-key-pair) and [adding](https://docs.gitlab.com/ee/user/ssh.html#add-an-ssh-key-to-your-gitlab-account) an SSH key to Gitlab.
2. (Optional) Installing dependencies. This step is optional, since the code can also be run with Docker without installing dependencies. It's necessary if you want to develop.
    - Install Poetry as described [here](https://python-poetry.org/docs/master/#installing-with-the-official-installer).
    - Change the cache directory. By default Poetry stores virtual environments inside the home directory, which shares storage space with other users, which can quickly overflow and cause problems. Instead, run `poetry config cache-dir /ANONYMOUS/venvs/nobackup/<your username>/pypoetry_cache` to point it to a better cache directory. Note that you might have to create this directory on the NFS first.
    - Then, create a virtual environment and install the project's dependencies by running `poetry install --with dev` inside the main project directory.
    - If you're committing changes, please install the pre-commit hooks first by running `poetry run pre-commit install`. The hooks will run before every commit to ensure a uniform code-style.
3. Configuration: All project configurations are stored inside the `project_config.yaml` file. Run `./actions.py init` to generate the `project_config.yaml` config file. You should only need to specify your ANONYMOUS username, the rest of the values will be inferred automatically. You can adjust some of them during config file generation, or by editing the config file.

### Running and evaluating experiments

4. Sync your code to the server, if you're setting up the project locally. Otherwise you can skip this step.
    - Once: If you haven't done so, I would recommend adding your public SSH key to the server. You can do so using `ssh-copy-id -i ~/.ssh/id_rsa.pub <ANONYMOUS-username>@contact.ANONYMOUS-ANONYMOUS.org`.
    - Once: SSH into the contact server (`ssh <ANONYMOUS-username>@contact.ANONYMOUS-ANONYMOUS.org`) and create the project directory on the server. Check the `server_project_root` value in the `project_config.yaml` file for the location that the script expects by default.
    - Sync (local -> server) the required project files by running `./actions.py sync to pyproject.toml poetry.lock actions.py src`. Later you can just sync the parts that changes, e.g. run `./actions.py sync to src` to only sync code that changed.
    - Also run step 3 on the server, to generate the `project_config.yaml` config file.
    - If you want to sync in the reverse direction (server -> local), for example to download artifacts generated by an experiment, you can do so using `.actions.py sync from artifacts`.
5. a) Run the Memorization-Dynamics Relationship (MD) experiment on the server:
    - Once: If you haven't setup a VPN connection to the ANONYMOUS network yet and are outside the institute core network, I would recommend doing so now by following the documentation [here](https://plex.ANONYMOUS-ANONYMOUS.ANONYMOUS/display/Documentation/Access+From+Outside%3A+ssh+and+vpn), and then starting the VPN.
    - SSH into one of the compute servers, e.g. `ssh <ANONYMOUS-username>@ANONYMOUS-2a40-05` and navigate to your project directory on the NFS.
    - Once: You might want to configure [Weights & Biases](https://wandb.ai/site). If you want to use Docker, once you've done that and have an API key, you should put it into a file under `$HOME/.config/dot_envs/wandb.env` with the content `WANDB_API_KEY=<key>` in your NFS home directory. The definitions in that fine will be accessible as environment variables inside the Docker containers.
    - Run the test configuration of the MD experiment:
        - With Docker: `./actions.py run-docker --gpus=0 torchrun src/main.py +md=test`
        - Directly on the machine: `poetry run torchrun src/main.py +md=test`
    - Ideally the code should run through and you should see some dummy training statistics at the end. At this point "the code works :)".
    - Results like evaluation statistics and model checkpoints are stored inside the `artifacts` directory (though this particular experiment does not save model checkpoints, since they're relatively quick to recompute and take up a lot of space).
5. b) Run the MD experiment on SLURM:
    - An alternative to running on the interactive SSH machines is to run on the SLURM [Compute Grid](https://wiki.ANONYMOUS-ANONYMOUS.org/wiki/ComputeGrid), which usually has more free capacity.
    - To run jobs there, SSH into `ANONYMOUS-grid-submit` by running `ssh <username>@ANONYMOUS-grid-submit`.
    - Run `./actions run-slurm -c 1a40 -t 1h src/main.py +md=test` to start a job. Configuring and starting the node can take a while.
    - Run `squeue` to see running jobs and monitor the progress.
    - The output written by the jobs is stored inside the `logs` directory.
6. Evaluating and sharing results
    - If you want to evaluate results locally, you should download them first using `./actions.py sync from artifacts`. Or just run everything on the server.
    - Start a notebook server using `./actions run-docker python src/main.py` (with Docker) or `poetry run python src/main.py` (without Docker), either on the server or locally. You might have to setup Docker first on your local machine.
    - If you are running the notebook server on the cluster, you need to enable port-forwarding to be able to access it from your browser. For that you can simply run `./actions.py connect <server-with-notebook>`.
    - Then you can just run notebooks normally. For example, for the MD experiment, you can open and run the notebook under `src/experiments/memorization_dynamics/notebooks/model_type.ipynb`.
    - NOTE: the first cell in that notebook does some necessary setup and should be run exactly once (it changes the working directory to make imports work properly, so if you rerun it, you have to restart the kernel). You should also copy the code in that cell to other notebooks if you create new ones.
    - The last cell in the notebook exports it as HTML (without the code, just the markdown and cell outputs) and uploads it to your personal ANONYMOUS webspace. If you get file not found errors you probably have to first create the respective directories under `public_html`. See the `results_root_dir` value in the `project_config.yaml` file for the path that the code expects (+ `results/<experiment_name>` subdirectories). After exporting and uploading the results, the code prints a link that you can share with others and embed on the Wiki.
    - NOTE 1: You can use [Plotly](https://plotly.com/python/) to generate interactive plots in the notebook. Those plots will be exported to HTML along with the notebook. In contrast to Matplotlib plots, these plots are interactive.
    - NOTE 2: The export and upload code only uploads the last saved version of the notebook, so you should save it (CTRL + S) before running the upload.
    - NOTE 3: The HTML files are uploaded using SSH. I usually run the code from my local machine and have forwarded the SSH agent into the Docker container accordingly. If you run the notebook server on the cluster, you might have to make your SSH key available there as well.


## Additional information on using the project

### Dependency Management

The project uses [Poetry](https://python-poetry.org/) to manage dependencies.
Poetry is more explicit than pip about how it keeps track of dependencies and creates virtual environments in away from the actual project, which makes it easier to maintain projects under different environments.
To install Poetry, consult [this guide](https://python-poetry.org/docs/master/#installing-with-the-official-installer).

Create a virtual environment and install dependencies using Poetry by running `poetry install` in the root directory.
Afterwards, to run code, either first activate the virtual environment with `poetry shell` and then run commands normally, or run commands prefixed with `poetry run ...`.

Alternatively, if you only want to execute and not develop code, you can skip the setup by using Docker (assuming Docker is setup on your system).

### Docker

Using Docker is optional, but recommended.

To restrict the set of GPUs that Docker containers should use, you can append the `--gpus=id1,id2` option to `./actions run-docker`.

The `src` directory is mounted, not copied, inside the container.
This way, all changes made on the host and from inside the container are reflected on the other.
You can edit mount options of the docker container in the `project_config.yaml` file under `docker`.

### Starting a notebook server

The best way to start a notebook server is to run `pyton src/main.py`, either inside a Docker container or after activating the Poetry virtual environment.

### Running experiments

To run a specific experiment, execute `[./actions.py run-docker] torchrun src/main.py +<experiment_name>=<config_name> [++sid=<seed_id>]`, where `<experiment_name>` is a shortcode specifying which experiment you want to run, and `<config_name>` is the name of one of the configurations for the experiment.
`torchrun` is used to enable training (large) LLMs via `deepspeed`.
If you don't need multi-GPU support, you can probably just replace `torchrun` with `python`.

You can override experiment configurations options on the command line with `++`, i.e. by appending `++<flag>.<subflag>=<value>`.
You can also execute multiple configurations sequentially by adding the `--multirun` or `-m` flag.
For example, to run configurations `test1` and `test2` for an experiment and overwrite the `training.epochs` parameter, you can use `./actions.py run-docker torchrun src/main.py -m +exp_id=test1,test2 ++exp_id.training.epochs=100`.

### Accessing results and other output

The output of experiments (checkpoints, results) is (supposed to be) stored inside the `artifacts` directory.
Logs are written to the `logs` directory.


## Contributing to the project

The project uses [git pre-commit hooks](https://pre-commit.com/) to maintain code style and integrity, i.e. for code formatting, linting and unit-testing.
Before making any commits, please install `pre-commit` as described below.
The unit tests run inside a Docker container, so make sure you have Docker installed as well (see below).

When using Poetry, install the library dependencies (separately from the parent project) with `poetry install --with dev` and then run `poetry run pre-commit install` to setup the pre-commit hooks.
The process for pip is analogous (TODO: specify the exact pip instructions once/if you use it).

Now, the pre-commit hooks should automatically run before every commit.
To manually run the pre-commit hooks, execute `poetry run pre-commit run --all-files`.


## Design Rationales

### Configuration

The project uses [Hydra](https://hydra.cc/) for configuration management.
For a quick intro to Hydra, see [this guide](https://hydra.cc/docs/intro/).

For specifying configuration throughout the code, e.g. the configuration for a training task, the preferred way is using [Python dataclasses](https://docs.python.org/3/library/dataclasses.html)

### Experiment structure

All experiments are located under `src/experiments/`.
There is a folder for each experiment containing the actual code to run it in `experiment.py`, different configurations that run the experiment in different ways in `config.py`, utilities for loading, visualizing and sharing results in `results.py` and notebooks for specific types of analysis.

Importantly, the `config.py` file of each experiment is supposed to define a handle object that is used by the `main.py` file to register and dispatch experiments.
For example, the Memorization Dynamics (MD) experiment defines an `MDHandle` object inside its `config.py` file which is imported by `main.py` and added into the `experiment_handles` array, such that it can be invoked from the command line.
The handle objects should define an `id` field, e.g. `md`, which is then used to determine which experiment to run.
E.g. if you run `./actions.py run-docker torchrun src/main.py +md=test [++sid=0]`, the code will dispatch the MD experiment with the test configuration.

If you need to create a new experiment, the easiest way would be to copy the folder of one of the existing experiments that matches the new one most closely and modify it accordingly to your needs.
In particular, update the `id` field in the experiment handle in the `config.py` file and register the new experiment in `main.py`.

### Running experiments vs analyzing results

The experiment setup decouples experiment execution from result analysis.
The experiments are run through the CMD commands, and they store results (mostly Pandas Dataframes with probabilities or loss values in this case) inside files in the `artifacts` directory.
After an experiment is done running, you can analyze its results, preferably from a Jupyter notebook (you can start one using `./actions run-docker python src/main.py`).


## Troubleshooting

### HDF5 Error when installing (py)tables
- Make sure you have HDF5 installed by running `brew install hdf5`
- Make sure poetry will find the paths, run:
    - `export HDF5_DIR=/opt/homebrew/opt/hdf5; export BLOSC_DIR=/opt/homebrew/opt/c-blosc`
    - https://stackoverflow.com/questions/73029883/could-not-find-hdf5-installation-for-pytables-on-m1-mac
- Then rerun `poetry install`
