# Shakespeare Dataset

## Setup Instructions
- Run preprocess.sh with a choice of the following tags:

  - ```-s``` := 'iid' to sample in an i.i.d. manner, or 'niid' to sample in a non-i.i.d. manner; more information on i.i.d. versus non-i.i.d. is included in the 'Notes' section
  - ```--iu``` := number of users, if i.i.d. sampling; expressed as a fraction of the total number of users; default is 0.01
  - ```--sf``` := fraction of data to sample, written as a decimal; default is 0.1
  - ```-k``` := minimum number of samples per user
  - ```-t``` := 'user' to partition users into train-test groups, or 'sample' to partition each user's samples into train-test groups; default is 'sample'
  - ```--tf``` := fraction of data in training set, written as a decimal; default is 0.9
  - ```--raw``` := include users' raw text data in all_data.json
  - ```--smplseed``` := seed to be used before random sampling of data
  - ```--spltseed``` :=  seed to be used before random split of data

i.e.
- ```./preprocess.sh -s niid --sf 1.0 -k 0 -t sample -tf 0.8``` (full-sized dataset)<br/>
- ```./preprocess.sh -s niid --sf 0.2 -k 0 -t sample -tf 0.8``` (small-sized dataset)
('-tf 0.8' reflects the train-test split used in the [FedAvg paper](https://arxiv.org/pdf/1602.05629.pdf))

Make sure to delete the rem_user_data, sampled_data, test, and train subfolders in the data directory before re-running preprocess.sh

## Notes
- More details on i.i.d. versus non-i.i.d.:
  - In the i.i.d. sampling scenario, each datapoint is equally likely to be sampled. Thus, all users have the same underlying distribution of data.
  - In the non-i.i.d. sampling scenario, the underlying distribution of data for each user is consistent with the raw data. Since we assume that data distributions vary between user in the raw data, we refer to this sampling process as non-i.i.d.
- More details on ```preprocess.sh```:
  - The order in which ```preprocess.sh``` processes data is 1. generating all_data, 2. sampling, 3. removing users, and 4. creating train-test split. The script will look at the data in the last generated directory and continue preprocessing from that point. For example, if the ```all_data``` directory has already been generated and the user decides to skip sampling and only remove users with the ```-k``` tag (i.e. running ```preprocess.sh -k 50```), the script will effectively apply a remove user filter to data in ```all_data``` and place the resulting data in the ```rem_user_data``` directory.
  - File names provide information about the preprocessing steps taken to generate them. For example, the ```all_data_niid_1_keep_64.json``` file was generated by first sampling 10 percent (.1) of the data ```all_data.json``` in a non-i.i.d. manner and then applying the ```-k 64``` argument to the resulting data.
- Each .json file is an object with 4 keys:
  1. 'users', a list of users
  2. 'hierarchies', a list of strings, with each string representing the group that the respective user belongs in; not present in i.i.d. data
  3. 'num_samples', a list of the number of samples for each user, and 
  4. 'user_data', an object with user names as keys; the values are represented as objects with keys 'x', 'y', and 'raw'. 'x' and 'y' refer to strings and their corresponding next character. 'raw' refers to the text data from which the model data was extracted; this key appears only in all_data.json, and only when the '--raw' tag is used.
- Run ```./stats.sh``` to get statistics of data (data/all_data/all_data.json must have been generated already)
- In order to run reference implementations in ```../models``` directory, the ```-t sample``` tag must be used when running ```./preprocess.sh```
