# Sample Order Effect Estimation Framework

This repository implements a retraining-free framework to estimate the impact of training sample orders on large language models (LLMs), based on the paper  *"Estimating the Effects of Sample Training Orders for LLMs without Retraining."*

## Project Structure

This project follows a 3-stage pipeline for estimation:

### Stage 1: Pretraining
- **`pretrain.py`**  
  Trains the LLM using a fixed, reference sample order to produce a set of reference checkpoints. These serve as the foundation for subsequent estimation.

### Stage 2: Precomputation
- **`precomputation.py`**  
  Computes and stores the update terms (gradients and higher-order derivatives) using the reference checkpoints and all training batches. Random projection is applied to reduce memory overhead.

### Stage 3: Estimation
- **`estimation.py`**  
  Given a new sample order, this script estimates the updated model parameters using the stored terms and a Taylor expansion-based approximation.

## Other Modules

- **`model.py`**: Contains model architecture and optimizer implementations.
- **`dataset.py`**: Loads and structures training data into batches.
- **`split_data.py`**: Splits datasets into training, validation, and test sets.
- **`preprocess.py`**: Performs data preprocessing such as data filtering.
- **`sample_order.py`**: Generates or loads different sample orders for experimentation.

## Requirements

Install dependencies using:

```bash
pip install -r requirements.txt
