# repo: Desynced Low Communication Adaptive Optimizers for Training Foundation Models

This README provides instructions on how to set up and run the repo.

## Table of Contents

- [repo: Desynced Low Communication Adaptive Optimizers for Training Foundation Models](#repo-federated-llm-pre-training)
  - [Table of Contents](#table-of-contents)
  - [Citation](#citation)
  - [System Requirements](#system-requirements)
  - [Installation](#installation)
    - [System Setup](#system-setup)
    - [Environment Installation](#environment-installation)
  - [Running repo](#running-repo)
    - [Dataset Preparation](#dataset-preparation)
    - [Federated Training](#federated-training)
    - [Centralized Training](#centralized-training)
    - [Evaluation](#evaluation)
  - [NeurIPS Guidelines for Experiment Reproducibility](#neurips-guidelines-for-experiment-reproducibility)

## System Requirements

- Ubuntu 22.04 (or compatible Linux distribution)
- NVIDIA GPU with CUDA 12.4 support
- Python 3.11.9
- At least 16GB of GPU memory for 125M parameter models (more for larger models)

## Installation

### System Setup

The `system_setup.sh` script sets up the system environment for repo, including CUDA, CuDNN, and Python:

```bash
cd scripts
. system_setup.sh
```

> Note that you may want to inspect the script prior executing it to remove parts of the installation that are not necessary for your system. For example, if you already have CUDA installed, you may want to remove the CUDA installation part from the script.
>
> Note that the `system_setup.sh` script will define some environmental variables that are going to be used in the next installation scripts, thus it is wise to execute it with the "." in case you run the following scripts in the same bash session. Alternatively, you can execute it and then source the .bashrc file.
>
> Note that the logic for storing AWS-S3 and Wandb credentials is also provided even though it is not strictly necessary for the framework to run. If one prefers to log to Wandb or use AWS-S3, one needs to set up such credentials and modify the configuration parameters in the example scripts accordingly.

This script:

- Installs essential system packages
- Installs CUDA 12.4 and sets up environment variables
- Sets GPU persistence mode
- Installs CuDNN
- Sets up uv and installs Python 3.12
- Installs necessary monitoring utilities
- [Optional] Sets up AWS S3 credentials for dataset streaming (you'll need to add your credentials)
- [Optional] Sets up Weights & Biases (wandb) for experiment tracking (you'll need to add your credentials)

### Environment Installation

After system setup, install the repo environment:

```bash
. install_env.sh --project_path /path/to/repo
```

Parameters:

- `--project_path` or `-p`: Path to the repo project directory (default: `$HOME/projects/repo`)

> Note that the script will install manually the flash-attention library, because UV has difficulty in installing it by matching the dependencies. This will incur in compiling the library from source, which might take a while depending on the system.
>
> Note that some NVIDIA GPUs don't support the `flash-attention` library. In this case, you can remove the `flash-attention` dependency from the `pyproject.toml` file and run the install env script again. You will need to modify accordingly the configuration parameters through hydra replacing `llm_config.model.attn_config.attn_impl=flash` with `llm_config.model.attn_config.attn_impl=torch`.
>
> Note that, similarly to the above, some NVIDIA GPUs don't support executing mixed precision context with bf16. In this case, we recommend switching to standard fp16 or fp32 by replacing `llm_config.precision=amp_bf16` with `llm_config.precision=amp_fp16` or `llm_config.precision=fp32` respectively.

This script:

- Sets up a uv environment with all required dependencies
- Installs flash-attention for GPU acceleration
- Sets up necessary environment variables

### NeurIPS Guidelines for Experiment Reproducibility

To adhere to anonymization requirements, all relevant repository forks have been provided within the accompanying zip file under the `forks` directory, with the `.git` folders removed. You must manually add these repositories to the `pyproject.toml` file.

This repository uses a highly flexible configuration system. The primary settings for federated and communication-efficient training are defined in `repo/conf/base.yaml`, which references a well-typed schema (`repo/conf/base_schema.py`) validated by pydantic. The model architectures are controlled via composer-compatible files (e.g., `repo/conf/llm_config/smollm-135m.yaml` for the 135M model and `repo/conf/llm_config/smollm-1B.yaml` for the 1.7B model), containing standard composer parameters, seeds for model initialization, and data sampling settings.

For experiments detailed in our paper, the critical configuration parameters within `base.yaml` are:

- `fl.n_local_steps`: Defines the base synchronization period.
- `fl.parameter_scheduler_kwargs`: Dictates synchronization periods for parameters (`PARAMETERS`), first momentum (`EXP_AVG`), and second momentum (`EXP_AVG_SQ`), expressed as multiples of the base period.

The default values run Local Adam with `K=fl.n_local_steps`. To configure custom synchronization frequencies (`K_x, K_u, K_v`), find their greatest common divisor (GCD), set `fl.n_local_steps` to this GCD, then configure each state accordingly:


- `fl.parameter_scheduler_kwargs.PARAMETERS = K_x / GCD`
- `fl.parameter_scheduler_kwargs.EXP_AVG = K_u / GCD`
- `fl.parameter_scheduler_kwargs.EXP_AVG_SQ = K_v / GCD`


The dataset streaming configuration relies on a compatible AWS S3 service (we use MinIO). The script `scripts/convert_hf_dataset_to_mds_smollm_corpus.sh` demonstrates how to download, convert, and upload datasets to S3-compatible storage. Alternatively, you may adapt this script for local use, but robustness cannot be guaranteed.

Data distribution across workers is defined in `repo/conf/dataset/streams`. Each list element under `client_streams:` corresponds directly to a worker. For example:

* IID setup: `repo/conf/dataset/streams/smollm_corpus_4_clients_iid.yaml`
* Non-IID setup: `repo/conf/dataset/streams/smollm_corpus_4_clients_non_iid.yaml`

These configurations reference shared constants and sampling algorithms defined in `repo/conf/dataset/smollm-corpus-shared.yaml` ([Mosaic sampling details](https://docs.mosaicml.com/projects/streaming/en/latest/dataset_configuration/replication_and_sampling.html)), also providing the seed for data sampling.

Example experiment runs are provided via scripts under `scripts/neurips/`:

* IID experiments (135M model): `scripts/neurips/sweep_iid.sh` (which calls `run_sweep_opt_iid_135M.sh`).
* To modify model size, replace `SMOLLM_135M` with `SMOLLM_1B`, adjust the model configuration file accordingly, and set parameters as needed.
* To change data distribution, update `smollm_corpus_4_clients_iid` to the desired stream file.
* Standard DDP experiments: change the script from `repo_base_independent.sh` to `base_centralised_training.sh` and set `global_train_batch_size` to 1024.

These instructions ensure accurate replication of our experiments, provided configurations and adjustments are applied carefully.
