# Proof-Pile-2

### Source Code

To build the source code portion of the Proof-Pile-2, run

```
python process_stack.py -c $NUM_CPUS
python process_github.py
```

To avoid exceeding the Github rate limit, set the `GIHUB_ACCESS_TOKEN` environment variable to your Github personal access token. 

### Isabelle Proofsteps
Isabelle Proofsteps are processed in `./process_isabelle_proofsteps.py`.

To process the Isabelle proofsteps, one needs to extract/download the PISA dataset using code available on https://github.com/albertqjiang/Portal-to-ISAbelle/, and then run `./process_isabelle_proofsteps.py` with the correct paths to the AFP extractions, the Standard library extractions and PISA's test set (`universal_test_theorems`).

### Lean proofsteps
First, [install Lean 4](https://lean-lang.org/lean4/doc/setup.html). Then, from this directory run
```
git clone https://github.com/semorrison/lean-training-data.git
```

Follow the further installation instructions in `lean-training-data`. 

Finally, run
```
python process_lean_proofsteps.py --vocab $PATH_TO_LLAMA_TOKENIZER_MODEL
```
