# MatPROV

**MatPROV** is a dataset of materials synthesis procedures extracted from scientific papers using large language models (LLMs) and represented in [PROV-DM](https://www.w3.org/TR/prov-dm/)–compliant structures.

---

## Files

```
MatPROV/
├── MatPROV.jsonl  # Main dataset (2,367 synthesis procedures)
├── ground-truth/  # Expert-annotated ground truth
│ └─ <DOI>.json
├── few-shot/      # Prompt examples used for synthesis procedure extraction
│ └─ <DOI>.txt
└── doi_status.csv # Status of each paper DOI across the pipeline
```

*Note: In file names under `ground-truth/` and `few-shot/`, forward slashes (`/`) in DOIs are replaced with underscores (`_`).*  

---

## Data format

The main dataset file is `MatPROV.jsonl`, where each line corresponds to one paper’s structured record.
Each record contains:

- `doi`: DOI of the source paper  
- `label`: Identifier for the extracted synthesis procedure, encoding the material's chemical composition and key synthesis characteristics (e.g., `CuGaTe2_ball-milling`)  
- `prov_jsonld`: A PROV-JSONLD structure describing the synthesis procedure  

### Example

```json
{
  "doi": "10.1002/advs.201600035",
  "label": "Fe1+xNb0.75Ti0.25Sb_composition variation",
  "prov_jsonld": {
    "@context": [
      {"xsd": "http://www.w3.org/2001/XMLSchema#", "prov": "http://www.w3.org/ns/prov#"},
      "https://openprovenance.org/prov-jsonld/context.jsonld",
      "URL of MatPROV's context schema omitted for double-blind review"
    ],
    "@graph": [
      {
        "@type": "Entity",
        "@id": "e1",
        "label": [{"@value": "Fe", "@language": "EN"}],
        "type": [{"@value": "material"}],
        "matprov:purity": [{"@value": "99.97%", "@type": "xsd:string"}]
      }
      ...
    ]
  }
}
```

## Dataset construction summary

- Source papers collected: 1648
- **Relevant Text Extraction**
  - 32 papers contained no synthesis-related text
  - → 1616 papers remained
- **Synthesis Procedure Extraction**
  - 48 papers contained no synthesis procedure
  - → 1568 papers remained (final dataset)

The DOIs of these 1568 papers and their extracted data are included in `MatPROV.jsonl`.
For details on the filtering status of each DOI, see `doi_status.csv`.

## Ground Truth annotations

- A subset of papers was manually annotated by a single domain expert.
- Files are stored in `ground-truth/` and named as `<DOI>.json`.

## Few-shot examples

- Prompt examples used for LLM extraction are provided in `few-shot/`.
- Files are stored in `few-shot/` and named as `<DOI>.txt`.

## Citation

If you use MatPROV, please cite:

```
TBD
```
