# HiRouter

## Introduction

This repository contains the code to our paper "Efficient Maximum Inner Product Search for Top-K Attention via Hierarchical Routers".

This code was run in a cluster with 8 H100s on a single node connected with HBM3, with CUDA 12.8.

## Setting up environments

In general, it is recommended to install the `requirements.txt` file by running
```bash
pip install -r requirements.txt
```
as this ensures that the versions of HuggingFace matches what is used for the models.

Occasionally, there may be an issue of `torch` not being installed, as it is a dependecy for `flash-attn`. In this case, we recommend installing
```bash
pip install torch torchvision torchaudio
```
and then running the first command again. If issues persist with symbols not being recognized, you may need to install `flash-attn` from a wheel directly.

**Note**: In our environment, we ran with `torch==2.6.0` and `flash-attn==2.6.3`.

## Training

We provide all our training scripts in the `routing-tree-attn` subfolder. This folder contains scripts and files for training models using the HiRouter structure.

**Note**: You may need to modify the bash scripts (in `the underlying `scripts/` folder) to load model configurations and other files properly. This is because we usually store models directly in a `models/` folder.

## Evaluation

We run our evaluations on existing benchmarks. For simplicity of managing environments and package version, we create a separate environment for each different benchmark we run. Each folder has its own `requirements.txt` file within it, which should be sufficient to determine which packages are necessary to run each benchmark.

### `lm-eval-harness`

For harness evaluation, you can simply follow the general instructions and use the `lm_eval` command as usual. To use a model using HiRouter, simply pass the `--routing` argument. In this case, we use the dependencies in the main folder followed by simply running
```bash
pip install -e .
```
within the `lm-eval-harness` folder.

### LongBench

Similarly, for LongBench, install the dependencies from the subfolder and follow their instructions for running. The specific file can be found at `LongBench/LongBench/requirements.txt`.

For clarity, our results were run using the `LongBench/LongBench/pred_fixed.py` file, which is a patched version of `LongBench/LongBench/pred.py` to support HiRouter. Again, run with `--routing` to indicate that the model is a HiRouter model.

Additionally, LongBench was run on a single GPU, therefore we do not make use of the `world_size` and `rank` arguments.

### S-NIAH

To run S-NIAH, we use the [`RULER`](https://github.com/NVIDIA/RULER) repository. Note that since these do not provide direct versioning, there may be some greater difficulty in installing the requisite packages. For this purpose, we provide a requirements file in `RULER/docker` with all the specific package versions we used. Additionally, we provide our modified/patched script for running these tasks in the `RULER` subdirectory.

**Note:** When running Mamba models from `mamba-ssm`, there may be a generation error if you do not set the temperature for generation for a value greater than 0.0.

## Other baselines

For other baselines, you may need to install the `flash-linear-attention` library and include the following line at the top of any file that runs generation/training.
```python
import fla
```
This ensures that the `fla` models are supported through the `transformers` interface. For more information, please refer to their [official repository](https://github.com/fla-org/flash-linear-attention).

For training baseline models, we used [`flame`](https://github.com/fla-org/flame) from the same authors. We modified the configuration files to use the `Eleuther-AI/gpt-neox-20b` tokenizer. 

## Citation

Coming Soon.
