# Empirically Testing Expressivity Bounds for Neural Network Architectures

## Directory Structure

* `experiments/`: Contains high-level scripts for reproducing all of the
  experiments and figures in the paper. These scripts are made to run on a
  computing cluster.
* `scripts/`: Contains helper scripts for setting up the software environment,
  building container images, running containers, installing Python packages,
  etc. Instructions for using these scripts are below.
* `src/recognizers/`: Contains source code for training neural networks,
  generating data, etc.
  * `analysis/`: Code for generating plots, analyzing predictions, etc.
  * `automata/`: Data structures and algorithms for automata.
  * `benchmarking/`: Performance benchmarking for certain algorithms.
  * `dataset_generation/`: Code for generating training, validation, and test sets.
  * `grammars/`: Data structures and algorithms for grammars.
  * `hand_picked_languages/`: Implementations of certain handpicked languages.
  * `language_sampling/`: Code for randomly sampling automata.
  * `neural_networks/`: Code for training and evaluating neural networks.
  * `string_sampling/`: Code for sampling positive and negative strings.
  * `tools/`: Certain useful tools.
* `tests/`: Contains pytest unit tests for the code under `src/`.


## Local Installation and Setup

In order to foster reproducibility, the code for this paper was developed and
run inside of a [Docker](https://www.docker.com/) container defined in the file
[`Dockerfile-dev`](Dockerfile-dev). To run this code, you can build the
Docker image yourself and run it using Docker. Or, if you don't feel like
installing Docker, you can simply use `Dockerfile-dev` as a reference for
setting up the software environment on your own system. You can also build
an equivalent [Singularity](https://sylabs.io/docs/#singularity) image which
can be used on an HPC cluster, where it is likely that Docker is not available
but Singularity is. There is a script that automatically sets up the Docker
container and opens a shell in it (instructions below).

### Using Docker

In order to use the Docker image, you must first
[install Docker](https://www.docker.com/get-started).
If you intend to run any experiments on a GPU, you must also ensure that your
NVIDIA driver is set up properly and install the
[NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html). Our experiments are quite fast even on CPU.

In order to automatically build the Docker image, start the container, and open
up a bash shell inside of it, run

    $ bash scripts/docker_shell.bash --build

After you have built the image once, there is no need to do so again, so
afterwards you can simply run

    $ bash scripts/docker_shell.bash

By default, this script starts the container in GPU mode, which will fail if
you are not running on a machine with a GPU. If you only want to run things in
CPU mode, you can run

    $ bash scripts/docker_shell.bash --cpu

### Using Singularity

Singularity is an alternative container runtime that is more suitable for
shared computing environments. Note: Singularity also goes by the name
Apptainer; they refer to the same thing.

In order to run the code in a Singularity container, you must first obtain the
Docker image and then convert it to a `.sif` (Singularity image) file on a
machine where you have root access (for example, your personal computer or
workstation). This requires installing both Docker and
[Singularity](https://docs.sylabs.io/guides/latest/user-guide/quick_start.html)
on that machine. Assuming you have already built the Docker image according to
the instructions above, you can use the following to create the `.sif` file:

    $ bash scripts/build_singularity_image.bash

This will create the file `neural-network-recognizers.sif`. It is normal for
this to take several minutes. Afterwards, you can upload the `.sif` file to
your HPC cluster and use it there.

You can open a shell in the Singularity container using

    $ bash scripts/singularity_shell.bash

This will work on machines that do and do not have an NVIDIA GPU, although it
will output a warning if there is no GPU.

### Additional Setup

Whatever method you use to run the code (whether in a Docker container,
Singularity container, or no container), you must run this script once (*inside
the container shell*):

    $ bash scripts/setup.bash

Specifically, this script installs the Python packages required by our code,
which will be stored in the local directory rather than system-wide.


## Adding New Python Packages

We use the package manager [Poetry](https://python-poetry.org/) to manage
Python packages. It's like pip, but you don't need to manually update a
`requirements.txt` file. Poetry tracks all of the Python packages required by
the code in the files `pyproject.toml` and `poetry.lock`. In order to add a
new Python package, run `poetry add <package-name>`, and commit the updated
`pyproject.toml` and `poetry.lock` files to git.


## Running Code

All files under `src/` should be run using `poetry` so they have access to the
Python packages provided by the Poetry package manager. This means you should
either prefix all of your commands with `bash scripts/poetry_run.bash` or run
`bash scripts/poetry_shell.bash` beforehand to enter a shell with Poetry's
virtualenv enabled all the time. You should run both Python and Bash scripts
with Poetry, because the Bash scripts might call out to Python scripts. All
Bash scripts under `src/` should be run with `src/` as the current working
directory.

All scripts under `scripts/` and `experiments/` should be run with the
top-level directory as the current working directory.

## Running Experiments

The [`experiments/`](experiments) directory contains scripts for reproducing
all of the experiments and plots presented in the paper. Some of these scripts
are intended to be used to submit jobs to a computing cluster. They should be
run outside of the container. You will need to edit the file
[`experiments/submit_job.bash`](experiments/submit_job.bash)
to tailor it to your specific computing cluster.

Other scripts are for plotting or printing tables and should be run inside the
container.

### Dataset Generation

Scripts for generating all of our datasets from scratch can be found under
`experiments/dataset_generation/`. All datasets are sampled using a fixed
random seed, so the results are deterministic.

Note that the plaintext files still need to be "prepared" (converted to
integers in .pt files) before being used to train neural networks using our
code.

* `submit_generate_data_jobs.bash`: Generate and prepare all datasets for 
  a certain language class. The language class and number of languages must
  be specified in the file.


Dataset generation consists of the following steps:

1. Write the DFA or CFG for the language to a .pt file.
2. Run weight pushing on the DFA or CFG so it can be used
   for sampling.
3. Randomly sample positive and negative examples for each split, and save the
   results as plaintext files.
4. Prepare the plaintext files by converting all symbols to integers and saving
   them in .pt files.

### Training Neural Networks

The relevant scripts are under `experiments/training/`. They should be run
after datasets are generated and prepared.

* `submit_train_and_evaluate_jobs.bash`: Train and evaluate all models on all
  languages from a language class. The language class and number of languages must
  be specified in the file.


### Analysis

The relevant scripts are under `experiments/analysis/`. They should be run
after models are trained and evaluated.

* `submit_print_summary_table.bash`: Generate a summary tables of the average accuracy
  of the models and percentage of languages learned perfectly for all classes.
* `submit_plot_summary_figure.bash`: Generate a figure simmarizing the performance of 
  all architectures on each language from each language class.
* `submit_plot_accuracy_vs_size.bash`: Plot accuracy as a function of machine
  size. Must specify the language class, number of languages, and the size measure (e.g.,
  number of states, alphabet size, etc.)
* `submit_plot_cross_entropy_vs_length.bash`: Plot average cross entropy as a function of
  input length. 
