# Codebase for *Sparser, Faster, Lighter Transformer Language Models*

This is an anonymized repository provided to enable full reproducibility of the experiments reported in our submission.
The codebase contains training, benchmarking, and kernel-level evaluation utilities for sparse transformer language models with gated feed-forward blocks.

After the review process, we plan to release a fully documented and open-sourced version of this repository, including cleaned configuration files, extended documentation, and additional examples.

---

## Installation

We recommend using a clean Python environment. All experiments in this repository were conducted with **Python 3.12**.

To install the required training and benchmarking dependencies, run the provided installation script:

```bash
./scripts/install.sh
```

The script installs all required Python packages as well as auxiliary tooling needed for training, benchmarking, and plotting results.

---

## Training

We provide streamlined launch scripts for training sparse and dense transformer language models using DeepSpeed ZeRO optimization. Training configurations are managed through Hydra.

### Example: Sparse-gated 1.5B model

To train a 1.5B-parameter sparse-gated model with ZeRO Stage 1:

```bash
./launch.sh ${num_gpus} sparsity_gated_1p5b zero1
```

Here:

* `${num_gpus}` specifies the number of GPUs available on the node.
* `sparsity_gated_1p5b` selects the base model configuration.
* `zero1` enables DeepSpeed ZeRO Stage 1.

### Modifying sparsity and model size

Additional model sizes, sparsity configurations, and training variants are defined in:

```
cfgs/run_cfg/
```

For example, to train a 2B-parameter sparse-gated model with a specific L1 regularization coefficient:

```bash
./launch.sh ${num_gpus} sparsity_gated_2b zero1 sparsity_l1_coeff=1e-5
```

### Dense baseline training

To train a dense (non-sparse) model using the same codebase, simply disable sparsity by setting:

```bash
./launch.sh ${num_gpus} sparsity_gated_1b zero1 sparsity_l1_coeff=0
```

By default, checkpoints, logs, and intermediate artifacts are saved under the `results/` directory.

---

## Benchmarking TwELL Kernels (Timing and Energy)

After training, we provide a unified benchmarking script to measure per-forward execution time and GPU energy consumption for dense and sparse implementations, including TwELL-based kernels.

Benchmarking is handled via:

```bash
benchmark_llm.py
```

### Required arguments

* `--model-path`: Path to a checkpoint directory or state dictionary file
* `--hydra-run-path`: Hydra run configuration file (e.g., under `cfgs/run_cfg/`)
* `--results-dir`: Output directory for benchmark CSVs and plots

### Example: Timing benchmark

To collect efficiency metrics (latency and throughput) for an sparse, gated 1.5B model with and without our kernels:

```bash
python benchmark_llm.py \
  --model-path /path/to/checkpoint/folder \
  --hydra-run-path sparsity_gated_1p5b \
  --results-dir results/benchmarks/exp1 \
  --batch-size 64 \
  --seq-len 2048
```

### Example: Timing + energy measurement

Energy measurements rely on periodic GPU power sampling and can optionally isolate MLP energy usage. To collect both timing and energy data:

```bash
python benchmark_llm.py \
  --model-path /path/to/checkpoint/folder \
  --hydra-run-path sparsity_gated_1p5b \
  --results-dir results/benchmarks/exp1_energy \
  --measure-energy true
```

### Outputs

The benchmarking script produces:

* `timing_df.csv`: Per-implementation timing and throughput summary
* `timing_plot.png`: Throughput and speedup visualizations (if `--plot true`)
* `*_seq_len_sweep.csv` and corresponding plots if `--seq-len-step` is specified

All outputs are saved to the specified `--results-dir`.

---

## Extended CUDA Build (Custom Sparse Kernels)

In addition to the Python and Triton-based pipeline, this repository supports an extended CUDA build that enables custom training kernels for sparse operations.

To build and install the CUDA extensions:

```bash
./scripts/training_extension.sh
```

This step is optional but required to reproduce the custom sparse training kernel results reported in the paper.

---

## Custom Training Pipeline with Sparse Kernels

Once the CUDA extensions are built, sparse-gated models can be trained using the custom kernel pipeline by selecting the appropriate configuration file.

Example:

```bash
./launch.sh ${num_gpus} training_extensions/sparsity_gated_1p5b.yaml zero1
```

This configuration replaces standard dense kernels with optimized sparse implementations for improved training efficiency and reduced memory usage.

---

## Additional Notes

* Running experiments requires downloading pretrained models and datasets hosted on Hugging Face.
  Please ensure you are logged in with a valid access token:

  ```bash
  huggingface-cli login
  ```

* Logging and experiment tracking may optionally use Weights & Biases.
  To disable W&B logging, modify the relevant Hydra configuration files and set:

  ```yaml
  report_to: null
  ```

* While distributed training is supported, most experiments can be reproduced on a single multi-GPU node, assuming sufficient memory.
