## SmolLM-DroPE Trainer

Recalibrate SmolLM-DroPE from the base [SmolLM-360M](https://huggingface.co/HuggingFaceTB/SmolLM-360M) on FineWeb-Edu data using HuggingFace Transformers, Hydra, Accelerate and DeepSpeed.

### Key features
- **Hydra run-configs** in `cfgs/run_cfg/` for reproducible experiments
- **DeepSpeed** ZeRO-1/ZeRO-3 configs in `accelerate_configs/`
- **Streaming datasets** from Hugging Face for large-scale training
- **Automatic gradient accumulation** computed from global and per-device batch sizes
- **W&B logging** 

---

## Installation


1) Create environment (Python 3.10+ recommended, tested with Python 3.11) and install deps:
```bash
conda create -n smollm-env python=3.11 -y && conda activate smollm-env
pip install --upgrade pip
./scripts/install.sh
```

2) (Optional) Login to Hugging Face and W&B:
```bash
huggingface-cli login    # if you need gated datasets/models
wandb login              # if you want online logging
```

---

## Quickstart
Use the launch helper which wraps `accelerate launch` and selects a DeepSpeed config.

- **Command:**
```bash
./launch.sh <num_gpus> smollm_drope/recalibration_30B.yaml zero1
```

- **Arguments:**
  - `<num_gpus>`: number of processes (e.g., 8 for 8 GPUs on one node)
  - `smollm_drope/recalibration_30B.yaml`: the Hydra run-config in `cfgs/run_cfg/`
  - `zero1`: selects the DeepSpeed ZeRO-1 accelerate config

This starts training with the defaults from the run-config and computes `gradient_accumulation_steps` to satisfy the requested global `train_batch_size`.

---

## Customizing runs
You can pass Hydra overrides after the first three arguments. E.g.:

- **Change batch sizes and steps:**
```bash
./launch.sh 8 smollm_drope/recalibration_30B.yaml zero1 \
        train_batch_size=512 per_device_train_batch_size=32 max_steps=120000
```

## Troubleshooting
If you encounter cryptic errors or stack traces from Hydra (the configuration system), set the environment variable `HYDRA_FULL_ERROR=1` to get a full traceback. This is especially helpful for debugging configuration or instantiation issues.

Example:
```bash
HYDRA_FULL_ERROR=1 ./launch.sh 8 smollm_drope/recalibration_30B.yaml zero1 \
        train_batch_size=512 per_device_train_batch_size=32 max_steps=120000
```
