# T-SHIRT: Token-Selective Hierarchical Data Selection for Instruction Tuning

## Requirements

To install the required packages for Python 3.12:

```bash
pip install -r requirements.txt
```

## Data Selection

To select data, run the following command:

```bash
bash scripts/select_data.sh
```

Selected data is provided in `datasets/alpaca_gpt4`.

## Training

To train the model(s) described in the paper, run one of the following commands:

```bash
bash scripts/train_qwen25_7b.sh tshirt_k_50 datasets/alpaca_gpt4/tshirt_k_50.json
```

or

```bash
bash scripts/train_llama31_8b.sh tshirt_k_75 datasets/alpaca_gpt4/tshirt_k_75.json
```

## Evaluation

### OpenLLM Leaderboard Benchmarks

We evaluate instruction-tuned models on six OpenLLM Leaderboard benchmarks. For detailed instructions, please refer to the [official LM-Eval-Harness repository](https://github.com/EleutherAI/lm-evaluation-harness).

Specifically, we use the following benchmarks:

* [ARC-Challenge](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/arc)
* [HellaSwag](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/hellaswag)
* [MMLU](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/mmlu)
* [TruthfulQA](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/truthfulqa)
* [BBH](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/leaderboard/bbh_mc)
* [GSM8k](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/gsm8k)

The corresponding LM-Eval-Harness task names are:

```
arc_challenge, hellaswag, mmlu, truthfulqa_mc2, leaderboard_bbh, gsm8k
```

### Arena-Hard

We use `Gemini-2.5-Flash-Preview-04-17` as the judge for Arena-Hard-v0.1. Please refer to the [official Arena-Hard repository](https://github.com/lmarena/arena-hard-auto) for evaluation details.

### AlpacaEval-2.0

We use `GPT-4o-2024-08-06` as the judge for AlpacaEval-2.0. Please refer to the [official AlpacaEval-2.0 repository](https://github.com/tatsu-lab/alpaca_eval) for evaluation details.