# HOBIT (Anonymous ICML Submission)

This repository is an anonymized code release accompanying an ICML submission.
It contains a dense retrieval training + evaluation pipeline with the **HOBIT** batch sampler (hardness-optimized batching for in-batch training).

No author, institution, or identifying information is included.

## Reproducing Paper-Scale Runs

Experiment configs are in `experiments/configs/` and are named `*-hobit-*.yaml`.
They assume you have already prepared TSV files for the datasets.

### Data format

The pipeline expects TSV files:

- `collection.tsv`: `doc_id<TAB>document_text`
- `queries.*.tsv`: `query_id<TAB>query_text`
- `qrels.*.tsv` (TREC qrels format): `query_id<TAB>0<TAB>doc_id<TAB>relevance`

### Data locations

The provided configs use relative paths under `./data/` (e.g., `./data/MSMARCO/...`).
If you store data elsewhere, either edit the YAML paths or symlink your dataset directory to match.

### Run training

Single GPU:

```bash
python src/train.py --config experiments/configs/<YOUR_CONFIG>.yaml
```

Multi-GPU (single node):

```bash
torchrun --nproc_per_node=NUM_GPUS src/train.py --config experiments/configs/<YOUR_CONFIG>.yaml
```

## Repo Pointers

- `src/train.py`: training entry point
- `src/evaluate.py`: evaluation entry point
- `src/samplers/batch_sampler/hobit.py`: HOBIT batch sampler
