# Overview

Basic datasets including MNIST, CIFAR, and Speech Commands will auto-download. Source code for these datamodules are in [basic.py](basic.py).

By default, data is downloaded to `./data/`  by default, where `.` is the top level directory of this repository.

- [Data Preparation](#data-preparation) - Instructions for downloading other datasets
- [Adding a Dataset](#adding-a-dataset-wip) - Basic instructions for adding new datasets

## Advanced Usage

After downloading and preparing data, the paths can be configured in several ways.

1. Suppose that it is desired to download all data to a different folder, for example a different disk.
The data path can be configured by setting the environment variable `DATA_PATH`, which defaults to `./data`.

2. For fine-grained control over the path of a particular dataset, set `dataset.data_dir` in the config. For example, if the LRA ListOps files are located in `/home/lra/listops-1000/` instead of the default `./data/listops/`,
pass in `+dataset.data_dir=/home/lra/listops-1000` on the command line or modify the config file directly.

3. As a simple workaround, softlinks can be set, e.g. `ln -s /home/lra/listops-1000 ./data/listops`


# Data Preparation

Datasets that must be manually downloaded include [LRA]

By default, these should go under `$DATA_PATH/`, which defaults to `./data`.  For the remainder of this README, these are used interchangeably.

## Long Range Arena (LRA)

LRA can be downloaded from the [GitHub page](https://github.com/google-research/long-range-arena).
These datasets should be organized as follows:
```
$DATA_PATH/
  pathfinder/
    pathfinder32/
    pathfinder64/
    pathfinder128/
    pathfinder256/
  aan/
  listops/
```
The other two datasets in the suite ("Image" i.e. grayscale sequential CIFAR-10; "Text" i.e. char-level IMDB sentiment classification) are both auto-downloaded.

The following sequence of commands prepares the LRA datasets in the default data path:
```
cd data
wget https://storage.googleapis.com/long-range-arena/lra_release.gz
tar xvf lra_release.gz
mv lra_release/lra_release/listops-1000 listops
mv lra_release/lra_release/tsv_data aan
mkdir pathfinder
mv lra_release/lra_release/pathfinder* pathfinder/
rm -r lra_release
```