# SIP - Simulation-Induced Prior

This is the code for the paper **Injecting a Structural Inductive Bias into a Seq2Seq Model by Simulation**.

## Setup
Create a conda environment with Python 3.10 and PyTorch 2.0. Then install the remaining requirements:

```pip install -r requirements.txt```

## Data generation
### Pre-training data
To generate the pretraining data for SIP-d4, run the following from the directory of the README:
```
python -m sip.data_gen.gen_pretrain
```
If you want the data for SIP-d4+, uncomment the two lines that mention SIP-d4+.
To generate the pre-training data for SIP-nd7, run:
```
python -m sip.data_gen.gen_bimachines
```

### Evaluation data
In order to generate the train/test data for experiments with deterministic FSTs,
open `data_gen/gen_in_pretraining_dist.py` and set `num_states` to the desired value and run it.

For the experiments with non-deterministic FSTs, run `data_gen/gen_bimachine_eval.py`.

## Pre-training
To pre-train SIP-d4, run
```
python config_evaluator.py configs/pretrain_SIP_d4.jsonnet
```
After pre-training SIP-d4, SIP-nd7 and SIP-d4+ can be pre-trained. 
The configuration files were designed on RTX 2080 TI GPUs, so they assume ~12 GB of GPU RAM.
For pre-training the task embedding baseline,
we assume two GPUs (one for the main model, one for the task embeddings) but this is configurable as well.

## Experimental Results
You can find the experimental results in the folder `results/`. 
For each set of experiment, it contains a csv file with one row per run. 

You can re-compute the aggregated results we report and analyze the results further. 
For example, to compute a results table for grapheme-to-phoneme conversion, run:
```
import pandas as pd

table = pd.read_csv("results/g2p_export.csv")
aggr = table.loc[:, ['lang', 'model', 'Acc', 'PER']].groupby(["lang", "model"]).mean()
print(aggr)
```
Each run is also associated with a json configuration (`config_text`) 
and the template it was instantiated from (`jsonnet`).

## Fine-tuning
To fine-tune, you first need a configuration file 
that specifies the paths to the training and evaluation data and hyperparameters.
Let's say, we want to reproduce SIP-d4 on the first iteration generalization task with 4 states in the (deterministic) FST. 
We'll extract the configuration file from the csv file:
```
import pandas as pd

table = pd.read_csv("results/synthetic_experiments_export.csv")
run = table.query("model == 'SIP-d4' & task == 0 & num_states == 4 & gen_type == 'Iteration' ")
with open("myconfig.json","w") as f:
    f.write(run['config_text'].iloc[0])
```
The configuration file can then be run as follows:
```
python config_evaluator.py myconfig.json
```
This will try to log results with Neptune.ai, and you need to supply your own project in the config file:
```
    "logger": {
        "f": "NeptuneLogger.create",
        "project": "<YOUR NAME>/<YOUR PROJECT>"
    },
```
Alternatively, you can log metrics to stderr/stdout with (see `logger.py` for details):
```
    "logger": {
        "f": "TqdmLogger.create",
        "print": true
    },
```
