# Dataset generation

#### Setup

- Symlink your habitat datafolder to this subfolder, e.g. `ln -s /path/to/habitat-data data`
- Activate your counda environment with habitat (0.2.3) installed

#### Generate dataset

Run the following command:

```
python -m dataset_generation.benchmark_generation.generate_instructions generator.llm.generation_params.engine=/path-to/Llama-2-70b-chat-hf generator.calls_per_scene=5
```

This will generate a set of jsons with a natural language instruction and an initialization. You will then need to parse and clean this JSON. For this, run:

```
python -m dataset_generation.benchmark_generation.parse_generated_instructions
```

The last step is to read these files, filter invalid samples and create a json.gzip episode dataset file. For this, run from CLI:

```
python -m dataset_generation.benchmark_generation.generate_episodes.py --gen-config <path_to_generator_config.json> --metadata-dict <path_to_metadata_config.json> --init-state-dicts <path_to_generated_inits.json>
```
where
- `gen-config` is the **optional** JSON config containing asset paths and output directories. A default is used if not provided.
- `metadata-dict` is the **optional** JSON config containing metadata (semantic .csv) paths. A default is used if not provided.
- `init-state-dicts` is the JSON config containing your parsed generated instructions. Should have a single parent key `"initial_state_dicts"` mapped to a list of per-episode initial state configs dicts.

See main function of `generate_episodes.py` for defaults and examples of the above configs. Also see `tests/test_episode_generator.py` for an example use of the scripting API.

Alternatively, you can run the pipeline script to run parsing, filtering, and episode generation all in one go using the `llmgen2episodes_pipeline.py` as follows. Additional CLI arguments are `--addclutter` and `--genpercalls` that are used to define whether the generation is done with addition of clutter and default number of generations done per LLM call during benchmark generation respectively. The latter param should be the same as used during benchmark generation -- it is used to calculate and compare expected generations vs. generations obtained after filtering hallucinations etc. This script will be deprecated as we move forward with the automated pipeline (.sh script).

```
python -m dataset_generation.benchmark_generation.llmgen2episodes_pipeline --rootfolder <path_to_folder_containing_llm_generated_jsons>
```

We now also have a way to run instruction + episode generation via sbatch with a desired episodes target. Remember to change to your own conda env. name for hab-llm in `run_e2epipeline.sh` before running this. Use the following script to run iterative llm instruction gen + episode gen till a desired target episode number is reached, after setting the `target_episodes`in the script. You should also specify the desired prompt file, whether you want clutter or not, log file name, and generation output path in this script before running the generation:
```
./dataset_generation/benchmark_generation/run_iterative_gen.sh
```
