# Sample Data Files for PDA Paper
This directory contains sample data files extracted from our novel benchmark dataset, **Form**alization for **L**ean **4** (**<span style="font-variant: small-caps;">forml4</span>**) , introduced in our paper (Process-Driven Autoformalization in Lean 4). 
Each file represents a **subset of 100 randomly sampled** entries from the corresponding full dataset. These samples showcase our research on formalizing and informalizing both mathematical questions and proofs.

## Files in this Directory

1. `Real_test_sample.json` 
2. `FormL4_basic_test_sample.json`
3. `FormL4_random_test_sample.json`
4. `FormL4_train_sample.json`

Each file contains a list of dictionary items, where each dictionary represents a mathematical problem and its associated proofs in both formal and natural language formats.

## Data Structure

Each dictionary in the sample files generally contains four key-value pairs:

1. `fl_statement_proof`: The ground truth formal statement and proof in Lean 4.
2. `nl_problem`: The informalized natural language-based statement of the problem.
3. `nl_explanation`: A natural language explanation of each step in the formal proof, based on the definitions of employed lemmas or tactics.
4. `nl_proof`: A step-by-step proof of the problem in natural language, without mentioning any Lean 4 functions verbatim.

### Exception: Real Test Sample

The `real_test_sample.json` file differs from the others:

- It lacks the `fl_statement_proof` and `nl_explanation` fields.
- This is because it originates from natural language sources and does not undergo the informalization process using LLMs.

## Important Notes on Data Curation

### Natural Language Proofs (nl_proof)
The `nl_proof` field represents our attempt to write proofs in natural language without directly referencing Lean 4 constructs. However, it's crucial to note that these natural language proofs may not align perfectly with their formalized counterparts. As emphasized in our paper:

> In empirical practice, we observe that it is usually *infeasible* to perfectly translate a set of formal proofs to natural language. This is because formal proofs are often expressed in pre-defined lemmas or environments that are exclusively constructed in the Lean 4 language, and there are no existing terms in natural language corresponding to them that a non-expert in Lean 4 could easily understand.

### Challenges in Informalization
Informalizing theorem proofs without explaining every lemma verbatim is an extremely challenging task, even for human experts in Lean 4. This challenge arises from the specialized nature of formal proof environments and the lack of direct natural language equivalents for many formal constructs.

## Usage

These sample files are provided to give researchers and interested parties insight into the structure and content of our dataset. They can be used to:

- Get an overview of the autoformalization process from natural language to formal language
- Understand the relationship between formal and informal proofs
- Analyze the challenges in proof translation between formal and natural languages
- Gain insights into the structure of mathematical proofs in both Lean 4 and natural language
- Evaluate autoformalization techniques on real-world mathematical problems (using the real test sample)

## Further Information

For more detailed information about our research methodology, findings, and the full dataset, please refer to our PDA paper.