# OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text

This codebase contains the code for the following stages of the OpenWebMath pipeline:

1. **Prefiltering**: Fast filters to remove most of the non-mathematical web documents from Common Crawl.
2. **Text Extraction**: Extracting text and LaTeX from HTML documents.
3. **Language Identification**: Identifying the language of the extracted text and filtering out non-English documents.
4. **MathScore Filtering**: Filtering out documents with low *MathScores*.
5. **Perplexity Filtering**: Filtering out documents with high perplexity.
6. **Deduplication**: Removing duplicate documents.

## Code Structure

The code is organized into three separate folders:

1. `text_extraction` contains the code for extracting text and LaTeX from HTML documents.
2. `extract_from_cc` contains the code for extracting the dataset from Common Crawl, including prefiltering, language identification, MathScore filtering, and perplexity filtering.
3. `filtering` includes many of the manual filtering steps, including blacklisted domains.

In order to run the `extract_from_cc` code, you either need to run it in Apache Spark or manually run `extract_from_warc.py` by passing in a WARC file as an argument.

For deduplication, please use the [text-dedup](https://github.com/ChenghaoMou/text-dedup) library.

Finally, for filtering, `filter.py` contains the code to load a Hugging Face dataset and filter it based on our heuristics.

## Artifact Access During Review

Due to the double-blind nature of the review period and the large file sizes involved, we are unable to provide access to the entire dataset, *MathScore* model, KenLM models, or any of the Pythia models during the review period. We do, however, include a small sample of 10,000 samples from the dataset in `sample_dataset.jsonl`.