# Spaced Scheduled Training (SST)

This repository contains the official implementation of the Spaced Scheduled Training (SST) algorithm, as described in the paper:

Spaced Scheduling for Large Language Model Training, by [Authors]

![Method Diagram](assets/method_diagram.png)

## Overview

SST is a novel, adaptive, and efficient data selection strategy for large language models that dynamically adjusts
the training dataset based on a model's evolving learning state. Unlike approaches that rely on external scoring models,
SST prioritizes training examples based on per-example perplexity, a computationally efficient and reliable proxy for example difficulty.

## Key Features

- **No External Reference Models**: Eliminates the need for costly external oracle models for data selection
- **Model-Tailored Selection**: Adapts selection to the target model's unique characteristics (size, pre-training data composition)
- **Dynamic Adaptation**: Continuously adjusts the dataset mix throughout training via a "spaced scheduling" mechanism
- **Consistent Performance**: Delivers strong results across model architectures and sizes (0.5B to 32B parameters)
- **Low Computational Overhead**: Efficiently implementable for large-scale training

## Performance Highlights

- Outperforms both random baselines and ChatGPT-based methods (InsTag, DEITA) across multiple model families
- Evaluated on eight LLMs ranging from 0.5B to 32B parameters from four distinct model families:
  - Llama 3.1
  - Llama 3.2
  - Gemma 2
  - Qwen 2.5
- Achieves higher Open LLM Leaderboard scores with less variance compared to baseline methods

## Implementation

SST works by:
1. Computing per-example perplexity across the training dataset
2. Using this metric to dynamically prioritize examples throughout training
3. Implementing a "spaced scheduling" mechanism that allows different models to emphasize data most beneficial at each training stage

## Usage

Step 1: Install the required dependencies

```sh
pip install -r requirements.txt
```

Step 2: Integrate SST trainer to your training pipeline.

We provide a slightly modified Hugging Face Trainer class that supports SST. You can use this class to train your model with SST.
In our work we used the training pipeline of [Open Instruct](https://github.com/allenai/open-instruct/blob/cf07adb0ed2055d98ff69033795587dac3416b70/open_instruct/finetune_trainer.py).

The SST trainer required two components:
1. The weighted sampler to remove or add examples to the training dataset by setting their sampling probability to 0 or 1/num_examples.
2. The SST state callback that update the selection window as training proceeds.

```python
from src.sampler.weighted_sampler import CustomWeightedRandomSampler
from src.trainer import SstTrainer, SstStateCallback

# Create a factory method to initialize the sampler
def create_sampler(**kwargs)
    """kwargs are the arguments to pass to torch WeightedRandomSampler"""
    return CustomWeightedRandomSampler(**kwargs)

# Create a factory method to initialize the SST state callback
def create_sst_state_callback(**kwargs)
    """See the SstStateCallback.Config class for the available arguments"""
    return SstStateCallback(**kwargs)

# NOTE: Using HF Transformers callbacks allows us to easily integrate SST with the existing HF Transformers training pipeline,
# but also allows us to make sure that any data transfer across GPU accurs during gradien sync removing any addtional communication overhead.


# Initialize the trainer with your model and dataset
trainer = SstTrainer(
    # Existing HF Transformers Trainer arguments,
    sampler_factory=create_sampler,
    sst_state_callback_factory=create_sst_state_callback,
)

# NOTE: When using a multi GPU setup, the factory methods above ensures that the accelerate library will
# correctly manage the sampler (using accelerate.prepare).
# The sst_state_callback_factory is useful to make sure that any tensors that need to be shared across GPUs are correctly handled,
# as opposed to initiating outside the trainer and then finging the GPU ids when creating the different tensors.

# Train with SST
trainer.train()
```


## Data

All the data used in the paper is available is available in our public Hugging Face repository: [sst-data](https://huggingface.com/TBD)
<!-- TODO: Update the data set link after double blind review -->

We also share the script we used to create the Instag and Deita dataset in the `data` folder.



## Models

We release the trained models used in the paper in our public Hugging Face repository: [sst-models](https://huggingface.com/TBD)
<!-- TODO: Update the model HF links after double blind review -->

## Citation

If you use SST in your research, please cite our paper:

```bibtex
@article{}
```
<!-- TODO: Update the citation set link after acceptance -->

## Contributing

We welcome contributions to improve SST! Please feel free to submit issues and pull requests.
