# Synthetic Data Generation

This folder contains the code for generating a dataset of step-by-step integration using Sympy. Here are the instructions to generate a dataset:

1. Follow the instructions on `dataset/README.md` to download the datasets from the paper [Deep Learning for Symbolic Mathematics](https://arxiv.org/abs/1912.01412). There are 3 datasets: `prim_fwd`, `prim_bwd`, and `prim_ibp`.
2. To generate step-by-step integration dataset from, for example, 1000000 samples from the `prim_ibp` dataset, run the following command:

```bash
python -m alpha_integrate.synthetic_data.decompose --num_expressions 1000000 --dataset prim_ibp
```

You can run this script again and it will automatically add more data into the folder `steps_dataset/prim_ibp` without duplicating the data. The script will also save the ids of the processed expressions into `steps_dataset/prim_ibp/ids/ids.pkl` to prevent duplicates. You can find the logs of the script in `decomposelogs` folder.

3. To shuffle the steps of the generated dataset, run the following command:

```bash
python -m alpha_integrate.synthetic_data.shuffle --dataset prim_ibp
```

This script will shuffle the steps of the expressions in the `steps_dataset/prim_ibp` folder and save the shuffled dataset into `final_steps_dataset/prim_ibp`. You can find the logs of the script in `shufflelogs` folder.

4. To generate a test dataset from the validation or test dataset of the paper, run the following command:

```bash
python -m alpha_integrate.synthetic_data.create_testset --save_path "alpha_integrate/synthetic_data/final_steps_dataset/val/prim_fwd_val.txt" --data_path "alpha_integrate/synthetic_data/dataset/prim_fwd/prim_fwd.valid"
```

where you can change .valid to .test to generate a test dataset. We suggest changing the folder to `val` or `test` to avoid confusion. You can also generate test dataset from `prim_bwd` and `prim_ibp` by changing `prim_fwd` everywhere in the command. The above command will generate a test dataset from `prim_fwd.valid` into `val/prim_fwd_val.txt`. If you ever get an error that the folder does not exist, we suggest creating the folder manually.

5. To generate test dataset from the split from the training dataset, you can run the following command:

```bash
python -m alpha_integrate.synthetic_data.create_traintestset --data_path alpha_integrate/synthetic_data/final_steps_dataset/prim_ibp/test.txt --save_path alpha_integrate/synthetic_data/final_steps_dataset/traintest/prim_ibp_traintest.txt
```

where you can change the dataset name to do it for all datasets.

6. To mix the datasets, you can run the following command:

```bash
python -m alpha_integrate.synthetic_data.shufflemix --dataset prim_fwd prim_ibp prim_bwd
```

This will mix the datasets and save them into `final_steps_dataset/{dataset names concatenated by +}`.

---

# Description of important folders and files

Folders with data:

- **`dataset`**
    - Contains datasets of the paper [Deep Learning for Symbolic Mathematics](https://arxiv.org/abs/1912.01412). You need to have `prim_bwd`, `prim_fwd`, and `prim_ibp` folders that has `{dataset_name}.train`, `{dataset_name}.valid`, `{dataset_name}.test` files for each dataset.

- **`steps_dataset`**
    - Contains 3 subfolders that contain files of format `{dataset_name}_{i}.txt` that contains step-by-step integration of expressions from the dataset `{dataset_name}` and a folder `ids`. `ids` contain `ids.pkl` which is a pickled dictionary that has lines that are already processed in `dataset/{dataset_name}/{dataset_name}.train` in its keys to prevent duplicates. Check out the explanation below for `decompose.py` to understand how these files are generated.

- **`final_steps_dataset`**
    - Contains the final dataset that is used for training the model. It has the same structure as `steps_dataset` folder, but each file contains a train.txt, val.txt, and test.txt file that has the final dataset for training the model. See the explanation below for `shuffle.py` to understand how these folders are generated from the `steps_dataset` folder. In addition, this folder contains a `test` and `val` folder that has the test and val dataset for each dataset in format `{dataset_name}_test.txt` or `{dataset_name}_val.txt` that just has a test expression and result in each line.

Code files:

- **`params`** 
    - Contains tokenization parameters both for the datasets from the paper as well as our representations. It also contains tokenization of the action space on symbolic expressions.

- **`create_testset.py`** 
    - Reads expressions from `dataset/{dataset_name}/{dataset_name}.valid` and saves it into folder `final_steps_dataset/test/{dataset_name}_test.txt` where every line contains an expression and its result separated by `\t`. For example, you can run:

    ```bash
    python -m alpha_integrate.synthetic_data.create_testset --save_path "alpha_integrate/synthetic_data/final_steps_dataset/val/prim_fwd_val.txt" --data_path "alpha_integrate/synthetic_data/dataset/prim_fwd/prim_fwd.valid"
    ```

    to generate test dataset from `prim_fwd.valid` into `val/prim_fwd_val.txt`.

    ```bash
    python -m alpha_integrate.synthetic_data.create_testset --save_path "alpha_integrate/synthetic_data/final_steps_dataset/test/prim_ibp_test.txt" --data_path "alpha_integrate/synthetic_data/dataset/prim_ibp/prim_ibp.test"
    ```

    to generate test dataset from `prim_ibp.test` into `test/prim_ibp_test.txt`.

- **`decompose_steps.py`** 
    - Implements the function `decompose_steps` that takes in an expression and integration symbol as a parameter and returns a list that is step-by-step integration of the expression in tokenized way if it can find the steps and returns `[None]` otherwise.

- **`decompose.py`**
    - The following script will randomly sample 500000 expressions from the file `synthetic_data/dataset/prim_ibp/prim_ibp.train` among the ones that are not already processed in `steps_dataset/prim_ibp/ids/ids.pkl` and save the step-by-step integration of these expressions into files that look like `steps_dataset/prim_ibp/prim_ibp_{i}.txt`. The reason for this is that the script splits the data into different parallel processes to speed up and each process saves the data into a different file. The script also saves the ids of the processed expressions into `steps_dataset/prim_ibp/ids/ids.pkl` to prevent duplicates. 

    ```bash
    python -m alpha_integrate.synthetic_data.decompose --num_expressions 500000 --dataset prim_ibp
    ```

    You can find the logs of the script in `decomposelogs` folder.

- **`shuffle.py`**
    -  Script that takes the data saved in multiple files in `steps_dataset/{dataset_name}` and creates `train.txt`, `val.txt`, and `test.txt` files in `final_steps_dataset/{dataset_name}` through shuffling steps for different expressions and shuffling the steps themselves. You can run it as follows: 

    ```bash
    python -m alpha_integrate.synthetic_data.shuffle --dataset prim_bwd
    ```

    The logs of the script can be found in `shufflelogs` folder.

- **`method.py`**
    - Contains the code for all rules that can be applied on symbolic expressions, which derive from the `Method` class.

- **`step_stats.py`**
    - Contains the code for calculating the statistics of the step-by-step integration dataset. For example, you can run it for `prim_ibp` as follows:

    ```bash
    python -m alpha_integrate.synthetic_data.step_stats --dataset prim_ibp
    ```

