# Creating Synthetic Datasets via Evolution for Neural Program Synthesis
Implementation of the experiments on PCCoder (https://github.com/amitz25/PCCoder) in the paper "Creating Synthetic Datasets via Evolution for Neural Program Synthesis"

## Running the evolutionary algorithm
scripts/train_adv.py runs the adversarial algorithm from our paper, which finds the parameters of the distribution on which a given model does worst.
```
python3.6 -m scripts.train_adv model
```

## Generating a dataset
scripts/gen_programs.py allows the generation of a dataset. For example:
```
python3.6 -m scripts.gen_programs --num_train=100000 --num_test=500 --train_output_path=train_dataset --test_output_path=test_dataset --max_train_len=12 --test_lengths="5" --num_workers=8
```
To generate datasets with specific distributions of integers and list lengths as in the paper, we simply change the parameters `integer_min`, `integer_max`, `min_list_len`, and `max_list_len` in params.py before running scripts/gen_programs.py.

## Training the network
scripts/train.py expects just the input dataset and the output path of the model. A model is saved for each epoch.
```
python3.6 -m scripts.train dataset model
```

CUDA is detected automatically and uses the GPU if it's available.

## Solving a test dataset
scripts/solve_problems expects a list of I/O sample sets and a network and solves them in multiple processes concurrently.
For now, CUDA is not used when solving problems. We advise you to not use this at test-time as its effect on speed is minimal (at test-time) and it may cause PyTorch some synchronization problems.
```
python3.6 -m scripts.solve_problems dataset result model 60 5 --num_workers=8
```

1. max_program_len dictates the maximum depth of the search.
2. The result file has a json dictionary line for each program predicted. The dictionary contains the predicted program and some details about the search, like the amount of time the search took and the final beam size.
3. Use --search_method to change the method from the default CAB search to DFS.

## Changing parameters
params.py contains all of the global "constants". This includes the program's memory size (which is calculated as params.num_inputs + params.max_program_len which are both changeable), number of exampes, DSL int range and max array size, and more.

## Program representation
As in https://github.com/dkamm/deepcoder, each program is represented in a compact string:
1. '|' delimites each statement
2. ',' delimits a function call and its arguments

Specifically, this is the general format:
```
INPUT_0_TYPE|...|INPUT_K_TYPE|FUNCTION_CALL_0,PARAMS_0|...|FUNCTION_CALL_N,PARAMS_N
```

For example, the program:
```
a <- [int]
b <- FILTER (%2==0) a
c <- MAP (/2) b
d <- SORT c
e <- LAST d
```

will be represented as:
```
LIST|FILTER,EVEN,0|MAP,/2,1|SORT,2|LAST,3
```

## Special thanks
We of course would like to thank Amit Zohar and Lior Wolf for publishing their code for PCCoder (https://github.com/amitz25/PCCoder), which we rely on for our experiments.
