# SIRD: Transformers Assisted Step by Step Symbolic Integration

Data can be downloaded from [here](http://tiny.cc/sird27m)
- Dataset is published as CSV format.
- File named as `expr_27465168.csv` carries 27 million+ samples with function-integration rule pairs

## Instructions to run Code
### Setup
1. Create a new conda environment and install the `requirements.txt` packages using pip.
2. Download datasets provided by [Lample & Charton 2019](https://arxiv.org/abs/1912.01412)
- FWD dataset zip `wget https://dl.fbaipublicfiles.com/SymbolicMathematics/data/prim_fwd.tar.gz`
- BWD dataset zip `wget https://dl.fbaipublicfiles.com/SymbolicMathematics/data/prim_bwd.tar.gz`
- IBP dataset zip `wget https://dl.fbaipublicfiles.com/SymbolicMathematics/data/prim_ibp.tar.gz`
- Following snippet can be used to unzip above datasets that will generate train, test and validation splits
  ```python
  import tarfile
  fname = 'prim_fwd.tar.gz'
  tar = tarfile.open(fname, "r:gz")
  tar.extractall()
  tar.close()
  ```
3. Download model weights for the [Lample & Charton 2019](https://arxiv.org/abs/1912.01412) model trained on FWD dataset
- `wget https://dl.fbaipublicfiles.com/SymbolicMathematics/models/fwd.pth`

### Dataset Generation & Creation
Making dataset involves two steps, extracting steps from functions of FWD dataset(dataset generation) and making those extrated steps individual samples(dataset creation).
1. `dataset_generation.py` can be used for dataset generation. It keeps on storing extracted steps for each function in a CSV
2. `dataset_creation.py` uses above generated CSV and create the final dataset in CSV format with each row as a function-integration rule pair
3. Both the files can be executed using simple `python` command.

### Model Training
Model training script uses PyTorch. It trains the model in a multi gpu setup in mixed precision mode and saves the checkpoint after each epoch is validation loss is reduced wrt last epoch. We have used 4 NVIDIA TESLA T4 GPUs to train our model.
1. `python symbolic_math_multi_gpu.py false` can be executed to train model from scratch
2. `symbolic_math_multi_gpu.py`, before training also filters the dataset. It removes samples with input sequence length greater than 384 and output sequence length greater than 30.

### Evaluation and Results
1. `manualintegrate_model.py` is the code which is termed as `guided_integral_steps` in the main text.
2. `manualintegrate_orig.py` is the code which is termed as `integral_steps` in the main text.
3. Our model used in `guided_integral_steps` can be downloaded from [here](https://drive.google.com/file/d/1SsuzOgcSRu25C8j38pwY-Y2tth-ipWyo/view?usp=sharing).
4. All the results can be calculated using following instructions:
- For `guided_integral_steps`:
  - `eval_integral_steps_model.py` can be used to run all the experiments
    -  `Line 205` can be edited to run the experiment for desired dataset
    -  To run the node limit experiments:
      - `Line 90 - Line 91` lines can be uncommented before running `eval_integral_steps_model.py` and node threshold can be set in the condition in `Line 90`.
- For `integral_steps`:
  - `eval_integral_steps_orig.py` can be used to run all the experiments
    -  `Line 205` can be edited to run the experiment for desired dataset
    -  To run the node limit experiments:
      - `Line 90 - Line 91` lines can be uncommented before running `eval_integral_steps_orig.py` and node threshold can be set in the condition in `Line 90`.

## References
Parts of this codebase are inspired from following:
- https://github.com/cloneofsimo/poly2SOP
- https://github.com/facebookresearch/SymbolicMathematics
- https://github.com/sympy/sympy
