# MatPROV: A Provenance Graph Dataset of Material Synthesis Extracted from Scientific Literature

## Environment

To get started, clone this repository and set up a virtual environment:

```bash
uv sync
source ./.venv/bin/activate
```

## Relevant Text Extraction

**1. Convert PDFs to XML and extract body text**

Use [GROBID](https://github.com/kermitt2/grobid) to convert PDF files into XML, then extract the body section as plain text.
Refer to the official [GROBID documentation](https://grobid.readthedocs.io/en/latest/Grobid-service/) for instructions on starting the GROBID server.
Place the target PDF files in the `input-dir`, naming each file as `<Paper's DOI>`.pdf — where any forward slashes (`/`) in the DOI are replaced with underscores (`_`) — and run the following command:

```bash
python src/processing/pdf_to_text.py \
  --input-dir "data/inputs" \
  --output-dir "data/inputs" \
  --grobid-url "http://<GROBID_server_URL>:8070"
```

**Arguments:**

| Argument     | Required | Default                 | Description                         |
|--------------|----------|-------------------------|-------------------------------------|
| --input-dir  | Yes      |                         | Directory containing PDF files      |
| --output-dir | Yes      |                         | Directory to save XML and TXT files |
| --grobid-url | No       | `http://localhost:8070` | GROBID server URL                   |

**2. Extract relevant text from body text using an LLM**

Given a plain text file named `<Paper's DOI>.txt`, an LLM extracts the synthesis-related text and saves a processed file named `<Paper's DOI>_llm.txt`.

```bash
export OPENAI_API_KEY="your-api-key"

python src/inference/relevant_text_extraction.py \
  --input-dir "data/inputs" \
  --output-dir "data/inputs" \
  --model "gpt-4o-mini" \
  --template-path "src/prompts/relevant_text_extraction.txt"
```

**Arguments:**

| Argument        | Required | Default       | Description                           |
|-----------------|----------|---------------|---------------------------------------|
| --input-dir     | Yes      |               | Directory containing TXT files        |
| --output-dir    | Yes      |               | Directory to save extracted TXT  file |
| --model         | No       | `gpt-4o-mini` | OpenAI model name                     |
| --template-path | Yes      |               | Prompt template file path             |

## Synthesis Procedure Extraction

Use an LLM to extract material synthesis procedures from the relevant text file (`<Paper's DOI>_llm.txt`) and save the results as a structured JSON file (`<Paper's DOI>.json`).

```bash
export OPENAI_API_KEY="your-api-key"

python src/inference/synthesis_procedure_extraction.py \
  --input-dir "data/inputs" \
  --output-dir "data/outputs" \
  --model "o4-mini" \
  --template-path "src/prompts/synthesis_procedure_extraction.txt" \
  --example-path "data/few-shot/10.1002_advs.201901598.txt"
```

If the `--example-path` argument is omitted, the extraction will run in zero-shot mode.

**Arguments:**

| Argument        | Required | Default   | Description                                      |
|-----------------|----------|-----------|--------------------------------------------------|
| --input-dir     | Yes      |           | Directory containing synthesis-related TXT files |
| --output-dir    | Yes      |           | Directory to save extracted JSON outputs         |
| --model         | No       | `o4-mini` | OpenAI model name                                |
| --template-path | Yes      |           | Prompt template file path                        |
| --example-path  | No       | `None`    | Path to one-shot example text file               |

## Extraction Performance Evaluation

After generating the JSON outputs, you can evaluate the LLM’s extraction performance by comparing the predictions against the ground-truth data:

```bash
python src/evaluation/evaluate.py \
  --pred-path "data/outputs" \
  --true-path "data/ground-truth" \
  --verbose
```

**Arguments:**

| Argument    | Required | Default | Description                                      |
|-------------|----------|---------|--------------------------------------------------|
| --pred-path | Yes      |         | Path to the LLM-generated JSON file or directory |
| --true-path | Yes      |         | Path to the ground truth file or directory       |
| --verbose   | No       |         | Enable verbose output                            |

## JSON Output Aggregation

Merge the individual JSON files (`<Paper's DOI>.json`) into a single JSONL file, where each line corresponds to one paper’s structured data, for easier downstream processing:

```bash
python src/processing/jsons_to_jsonl.py \
  --input-dir "data/outputs" \
  --true-path "data/ground-truth" \
  --output "data/outputs/MatPROV.jsonl"
```

**Arguments:**

| Argument    | Required | Default | Description                                   |
|-------------|----------|---------|-----------------------------------------------|
| --input-dir | Yes      |         | Directory containing LLM-generated JSON files |
| --true-path | Yes      |         | Directory containing ground truth JSON files  |
| --output    | Yes      |         | Output JSONL file path.                       |