{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "40647db0-0509-4550-8b7b-bee114ca270e",
   "metadata": {},
   "source": [
    "# GneissWeb Recipe\n",
    "\n",
    "#### In order to be able to reproduce GneissWeb, we provide a notebook here that presents the GneissWeb recipe and applies the components in sequence to reproduce the GneissWeb processing pipeline. \n",
    "<br>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8a46fe08-3f61-4ed4-95fc-afeafefd20e1",
   "metadata": {},
   "source": [
    "\n",
    "### **An Overview of the GneissWeb Recipe**\n",
    "##### The GneissWeb dataset was obtained by applying the following processing steps:\n",
    "##### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - Step 1: Exact substring deduplication at line level \n",
    "##### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - Step 2: Quality annotators: \n",
    "##### &nbsp;&nbsp;&nbsp;&nbsp;&emsp;&emsp;&nbsp; - Step 2.1: Custom built fastText Quality Classifier \n",
    "##### &nbsp;&nbsp;&nbsp;&nbsp;&emsp;&emsp;&nbsp; - Step 2.2: Custom built fastText Category Classifiers \n",
    "##### &nbsp;&nbsp;&nbsp;&nbsp;&emsp;&emsp;&nbsp; - Step 2.3: Custom built Readability Score Quality Annotator \n",
    "##### &nbsp;&nbsp;&nbsp;&nbsp;&emsp;&emsp;&nbsp; - Step 2.4: Custom built Extreme-Tokenized-Documents Quality Annotator \n",
    "##### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - Step 3: Category-aware Ensemble Quality Filter\n",
    "\n",
    "#####  These were applied in the order.\n",
    "\n",
    "<b>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "90ed3dab-30af-41c6-a5c3-f665a7c49243",
   "metadata": {},
   "source": [
    "## Prerequisites\n",
    "\n",
    "#### <b>1- To run the repetition removal transform, Rust is required to be installed on the machine. You can install rust following the instructions [here](https://www.rust-lang.org/tools/install). </b>\n",
    "\n",
    "##### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Note: If Rust is not added to your \\$PATH, run the steps below to add the rust installation location for proper execution. You can use the <i><b>!whereis cargo</i></b> command to find where rust is installed in your machine, and set the path there up to the /bin\n",
    "\n",
    "##### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ex: whereis cargo produces: cargo: /Users/USERNAME/.cargo/bin/cargo\n",
    "\n",
    "##### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; set the $PATH to include /Users/USERNAME/.cargo/bin/\n",
    "\n",
    "#### <b>2- Setup a virtual environment for running the notebook. The instructions below provide one example using python venv. </b>\n",
    "\n",
    "```\n",
    "python -m venv venv\n",
    "source venv/bin/activate\n",
    "pip install jupyterlab\n",
    "venv/bin/jupyter lab\n",
    "```\n",
    "\n",
    "#### <b>3- The gneissweb classifiers use models stored on HuggingFace. Users need to provide their own read-access token for retrieving the models. In this notebook, we assume the users have used an environment variable named HF_TOKEN for storing their HF token. </b>\n",
    "(Model paths will be included in the final version.)\n",
    "```\n",
    "export HF_TOKEN='hf_xxx'\n",
    "```\n",
    "\n",
    "#### Note about running the Notebook: This notebook involves complex operations that may take between 1 to 2 hours to complete, depending on the size of the input file(s) selected for downloading from HuggingFace and the available memory and CPU of the machine. \n",
    "Install data-prep-kit.  \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5e980653-30c9-4e2a-81de-b1da65c01f50",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "credential=os.environ['HF_TOKEN']"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bf2ee4b2-d3ca-4d4f-9a37-e0c029a241c2",
   "metadata": {},
   "source": [
    "### Sample Data for testing: Load and split large parquet file"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9a23bd14-cfb5-4f8a-8819-ce446eecd15b",
   "metadata": {},
   "source": [
    "#### Download a parquet file from HF using the HF download API"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e3a74dff-faee-44ae-9f0a-d7dfd8be2ccc",
   "metadata": {},
   "outputs": [],
   "source": [
    "from huggingface_hub import hf_hub_download\n",
    "import pandas as pd\n",
    "\n",
    "REPO_ID = \"HuggingFaceFW/fineweb\"\n",
    "FILENAME = \"data/CC-MAIN-2013-20/000_00000.parquet\"\n",
    "\n",
    "file1 = hf_hub_download(repo_id=REPO_ID, filename=FILENAME, repo_type=\"dataset\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "04a82609-58fc-41d9-b434-10954772851d",
   "metadata": {},
   "source": [
    "#### Resize the file to a smaller size for testing purposes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a60cd92e-2931-472f-8ee1-491523bfdc1a",
   "metadata": {},
   "outputs": [],
   "source": [
    "from dpk_resize.runtime import Resize\n",
    "Resize(input_folder= os.path.dirname(file1),\n",
    "        output_folder= \"input\",\n",
    "        resize_max_mbytes_per_table= 200).transform()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "594ab090-f34d-4707-8a16-f91bfd7e36c9",
   "metadata": {},
   "source": [
    "## Step 1. Exact substring deduplication\n",
    "##### This component applies exact substring deduplication to remove any substring of predetermined length that repeats more than once within a single parquet file level by adapting the implementation from [deduplicate-text-datasets](https://github.com/google-research/deduplicate-text-datasets)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dbc63703-f3f7-4c6e-b325-86f2679a5bdb",
   "metadata": {},
   "source": [
    "#### Input Parameters\n",
    "\n",
    "The transform can be initialized with the following parameters:\n",
    "\n",
    "| Parameter                          | Default                            | Description                                       |\n",
    "|------------------------------------|------------------------------------|---------------------------------------------------|\n",
    "| `rep_removal_contents_column_name` | `contents`                         | Name of the column holding the document contents  |\n",
    "| `rep_removal_length_thresh`         | `50`                               | Length threshold for processing                   |\n",
    "| `rep_removal_frequency_threshold`  | `1`                                | Frequency threshold for processing                |\n",
    "| `rep_removal_retain_first_copy`    | `True`                             | Boolean value for whether to retain first copy    |\n",
    "| `rep_removal_num_threads`          | `psutils.cpu_count(logical=False)` | Value for number of threads to use for processing |\n",
    "\n",
    "\n",
    "#### Output Format\n",
    "\n",
    "The output format will be a new parquet file with the repeated sequence(s) removed.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2ab0fd19-dba5-4934-b510-c8c964b6ca00",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "from dpk_rep_removal.runtime import RepRemoval\n",
    "\n",
    "RepRemoval(input_folder= \"input\",\n",
    "            output_folder= \"tmp/rep_removal\",\n",
    "            rep_removal_contents_column_name='text', \n",
    "            rep_removal_num_threads='10',\n",
    "            ).transform()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7c5ff67d-7ce9-4e7c-b283-bb75739ea876",
   "metadata": {},
   "source": [
    "## Step 2. Annotation\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "10d1a2b7-bc06-4dba-ae7c-764444f79da4",
   "metadata": {},
   "source": [
    "### Step 2.1. fastText Quality Annotator\n",
    "##### This transform annotates each document with two fastText quality classifiers: \n",
    "##### (i) GneissWeb.Quality_annotator classifier trained on a mix of high-quality synthetic data and data annotated by an LLM for high educational value \n",
    "##### (ii) the fastText classifier from DCLM\n",
    "\n",
    "These fastText models are used as part of the ensemble filter in GneissWeb to detect and remove low-quality documents. \n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e81af0ca-3afa-46f1-ad72-ea913c842ea9",
   "metadata": {},
   "source": [
    "#### Input Parameters\n",
    "\n",
    "The transform can be initialized with the following parameters:\n",
    "\n",
    "| Parameter                             | Default                            | Description                                       |\n",
    "|------------------------------------   |------------------------------------|---------------------------------------------------|\n",
    "| `gcls_model_credential`               | unset                              | Credential you use to get model. This is huggingface token. [Guide to get huggingface token](https://huggingface.co/docs/hub/security-tokens)  |\n",
    "| `gcls_model_file_name`                | unset                              | specifies what filename of model you use to get model, like fasttext_gneissweb_quality_annotator.bin|\n",
    "| `gcls_model_url`                      | unset                              | specifies url that model locates. For fasttext, this will be repo name of the model, like HF_url/GneissWeb.Quality_annotator                |\n",
    "| `gcls_content_column_name`            | contents                           | Name of the column containing documents   |\n",
    "| `gcls_output_lablel_column_name`      | label                              | Name of the output column to hold predicted classes |\n",
    "| `gcls_output_score_column_name`       | score                              | Name of the output column to hold score of prediction |\n",
    "\n",
    "\n",
    "#### Output Format\n",
    "\n",
    "The output format will be a new parquet file with the label and score columns added.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "df8ef14a-6970-4eb1-a6ac-b2f92259d894",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time \n",
    "\n",
    "from dpk_gneissweb_classification.transform_python import Classification\n",
    "\n",
    "Classification(input_folder= \"tmp/rep_removal\",\n",
    "        output_folder= \"tmp/fastText/quality_DCLM\",\n",
    "        gcls_model_credential= credential,\n",
    "        gcls_model_file_name= [\"fasttext_gneissweb_quality_annotator.bin\",\"openhermes_reddit_eli5_vs_rw_v2_bigram_200k_train.bin\"],\n",
    "        gcls_model_url= [\"HF_url/GneissWeb.Quality_annotator\",\"mlfoundations/fasttext-oh-eli5\"],\n",
    "        gcls_n_processes=4,\n",
    "        gcls_output_label_column_name= [\"cosmo_fastText_label\",\"dclm_fastText_label\"],\n",
    "        gcls_output_score_column_name= [\"cosmo_fastText_score\",\"dclm_fastText_score\"],\n",
    "        gcls_content_column_name= \"text\").transform()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0b3ad219-944d-4592-81b1-84a1d26ae99c",
   "metadata": {},
   "source": [
    "### Step 2.2. Document Category Classifiers\n",
    "##### This step annotates each document using four fastText category classifiers:\n",
    "##### &emsp;&emsp;  - GneissWeb.Med_classifier\n",
    "##### &emsp;&emsp;  - GneissWeb.Edu_classifier\n",
    "##### &emsp;&emsp;  - GneissWeb.Tech_classifier\n",
    "##### &emsp;&emsp;  - GneissWeb.Sci_classifier\n",
    "\n",
    "These fastText models are used as part of the ensemble filter in GneissWeb to leverage the category annotations in category-aware readability score quality filtering and extreme-tokenized quality filtering.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4737ffd4-48f2-4106-80a9-72ef1efb21e6",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time \n",
    "\n",
    "Classification(input_folder= \"tmp/fastText/Quality_DCLM\",\n",
    "        output_folder= \"tmp/fastText/category_classifier\",\n",
    "        gcls_model_credential= credential,\n",
    "        gcls_model_file_name= [\"fasttext_medical.bin\",\"fasttext_education.bin\",\"fasttext_technology_computing.bin\", \"fasttext_science.bin\"],\n",
    "        gcls_model_url= [\"HF_url/GneissWeb.Med_classifier\",\"HF_url/GneissWeb.Edu_classifier\",\"HF_url/GneissWeb.Tech_classifier\", \"HF_url/GneissWeb.Sci_classifier\"],\n",
    "        gcls_n_processes=4,\n",
    "        gcls_output_label_column_name= [\"medical_label\",\"education_label\",\"technology_computing_label\",\"science_label\"],\n",
    "        gcls_output_score_column_name= [\"medical_score\",\"education_score\",\"technology_computing_score\",\"science_score\"],\n",
    "        gcls_content_column_name= \"text\").transform()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e29c9a90-f0ba-4849-a45c-0a582f23ac2d",
   "metadata": {},
   "source": [
    "### step 2.3. Readability Scores Quality Annotator\n",
    "\n",
    "Readability scores are formulas based on text statistics (such as sentence length, average number of words, number of syllables etc.) designed to assess how easily the text can be read and understood.\n",
    "\n",
    "This transform calculates the McAlpine-EFLAW readability score for each document in the output parquet file from the previous step and adds [McAlpine-EFLAW](https://www.angelfire.com/nd/nirmaldasan/journalismonline/fpetge.html) readability score column to the data.\n",
    "\n",
    "McAlpine-EFLAW readability score of a document is a numerical score computed as a function of the number of words in a document plus the number of mini-words (consisting of ≤ 3 characters) divided by the number of sentences. Lower score means the document is easier to understand for a reader with English as a foreign language. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9782a201-8adb-4a79-acb7-98b573af130d",
   "metadata": {},
   "source": [
    "#### Input Parameters\n",
    "\n",
    "The transform can be initialized with the following parameters:\n",
    "\n",
    "| Parameter                          | Default                            | Description                                       |\n",
    "|------------------------------------|------------------------------------|---------------------------------------------------|\n",
    "| `readability_contents_column_name` | `text`                             | specifies the name of the column holding the document text.  |\n",
    "| `readability_score_list`           | `mcalpine_eflaw_textstat`          | list of readability scores to be computed by the transform   |\n",
    "\n",
    "\n",
    "\n",
    "#### Output Format\n",
    "\n",
    "The output format will be a new parquet file with the Readability scores added.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "96a2e3cf-7c3d-4c5b-bb28-bde0bddb7390",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "from dpk_readability.runtime import Readability\n",
    "\n",
    "Readability(\n",
    "    input_folder=\"tmp/fastText/category_classifier\",\n",
    "    output_folder=\"tmp/readabilty\",\n",
    "    readability_contents_column_name=\"text\",\n",
    "    readability_score_list=[\"mcalpine_eflaw_textstat\"],\n",
    ").transform()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8e09597a-d8d9-4e88-b1ef-4502f54447de",
   "metadata": {},
   "source": [
    "### Step 2.4. Extreme-Tokenized-Documents Quality Annotator\n",
    "##### This annotator retrieves the tokens generated for a set of documents. Then, it calculates, for each document, the size and the total number of characters. The number of tokens is divided by the size and by the number of characters, and the resulting values are stored in two columns ( tokens_per_doc_size and tokens_per_doc_num_chars).\n",
    "\n",
    "##### The annotator transform annotates the input table with 5 columns:\n",
    "\n",
    "##### &emsp;&emsp;1. doc_num_tokens - number of tokens for each document\n",
    "##### &emsp;&emsp;2. doc_size_kbs - document size in kb\n",
    "##### &emsp;&emsp;3. doc_num_chars - number of characters in the document\n",
    "##### &emsp;&emsp;4. tokens_per_doc_size - ratio between number of tokens and document size\n",
    "##### &emsp;&emsp;5. tokens_per_doc_num_chars - ratio between number of tokens and number of characters in document\n",
    "##### Documents with extremely high or low number of tokens per character (or tokens per byte) are identified as extreme-tokenized documents and can be excluded in the filtering step.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6e805d12-418d-4e00-beed-8b570b9f2801",
   "metadata": {},
   "source": [
    "#### Step 2.4.1 Tokenization"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "734018bf-8330-4fe3-870f-b14c87da8b91",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "from dpk_tokenization2arrow.runtime import Tokenization2Arrow\n",
    "\n",
    "Tokenization2Arrow(\n",
    "        input_folder= \"tmp/readabilty\",\n",
    "        output_folder= \"tmp/arrows\",\n",
    "        tkn_tokenizer=  \"bigcode/starcoder\",\n",
    "        tkn_doc_id_column= \"id\",\n",
    "        tkn_doc_content_column= \"text\",\n",
    "        tkn_chunk_size= 20_000).transform()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "797926be-09b3-4d39-9c8f-aba4f7bef4e9",
   "metadata": {},
   "source": [
    "#### Step 2.4.2 Annotation"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b69c6694-8143-4f68-986a-8f47dc2949c1",
   "metadata": {},
   "source": [
    "#### Input Parameters\n",
    "\n",
    "The transform can be initialized with the following parameters:\n",
    "\n",
    "| Parameter                          | Default                            | Description                                       |\n",
    "|------------------------------------|------------------------------------|---------------------------------------------------|\n",
    "| `et_contents_column_name`          | `text`                             | specifies the name of the column holding the document text.  |\n",
    "| `et_arrow_path`                    | `unset`                            | location of the folder containing the arrow (tokenization) files.   |\n",
    "\n",
    "\n",
    "\n",
    "#### Output Format\n",
    "\n",
    "The output format will be a new parquet file with 5 columns added."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "02e23cb3-a184-409d-b6cf-5649f5e35d21",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "from dpk_extreme_tokenized.runtime import ExtremeTokenized\n",
    "\n",
    "ExtremeTokenized(\n",
    "    input_folder=\"tmp/readabilty\",\n",
    "    output_folder=\"tmp/extreme_tokenized\",\n",
    "    et_contents_column_name=\"text\",\n",
    "    et_arrow_path=\"tmp/arrows\",\n",
    ").transform()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "eadce7d7-1e39-4c1b-8f0b-39b115d8223a",
   "metadata": {},
   "source": [
    "### Step 3. Category-aware Ensemble Quality Filter\n",
    "\n",
    "##### GneissWeb ensemble filtering rule: A document is retained if either the fastText combination and category-aware readability score filter agree, or the fastText combination and category-aware extreme-toeknized filter agree. Here the fastText combination is logical OR of the fastText classifiers, i.e., either of the fastText classifiers agrees. Please refer to the GneissWeb paper for more details.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cefffcae-2bcd-4342-817b-b8b34fd5c9be",
   "metadata": {},
   "source": [
    "\n",
    "##### This filtering step filters out low-quality documents from the input data using multiple quality annotators and by leveraging the category information of documents. \n",
    "\n",
    "#### Input Parameters\n",
    "\n",
    "The transform can be initialized with the following parameters:\n",
    "\n",
    "| Parameter                          | Default                            | Description                                       |\n",
    "|------------------------------------|------------------------------------|---------------------------------------------------|\n",
    "| `filter_criteria_list`             | `[]`                               | specifies the list of row filter criteria (in SQL WHERE clause format). Each filter criterion is a string. The default value of this parameter is [] (an empty list, meaning that all the rows in the input table will be kept).  |\n",
    "| `filter_logical_operator`          | `AND`                              | specifies the logical operator that joins filter criteria (AND or OR).   |\n",
    "| `filter_columns_to_drop`           | `[]`                              | the list with the names of the columns to drop after row filtering is complete. The default value of this parameter is [] (an empty list, meaning that all the columns in the input table will be kept).   |\n",
    "\n",
    "\n",
    "\n",
    "#### Output Format\n",
    "\n",
    "The output format will be a new parquet file with the rows that do not meet a specific set of criteria removed.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9c2030da-40a7-4eca-8841-3b5e59165694",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "from dpk_filter.transform_python import Filter\n",
    "\n",
    "Filter(input_folder= \"tmp/extreme_tokenized\",\n",
    "        output_folder= \"output\",\n",
    "        filter_criteria_list= \n",
    "        ['((\"dclm_fastText_score\" > 0.002 OR \"cosmo_fastText_score\" > 0.03)) AND (((mcalpine_eflaw_textstat < 70) AND (technology_computing_label IN (\\'technology\\') OR medical_label IN (\\'medical\\') OR education_label IN (\\'education\\') OR science_label IN (\\'science\\'))) OR ((mcalpine_eflaw_textstat < 30) AND (technology_computing_label IN (\\'cc\\') AND medical_label IN (\\'cc\\') AND education_label IN (\\'cc\\') AND science_label IN (\\'cc\\'))))',\n",
    "         '((\"dclm_fastText_score\" > 0.002 OR \"cosmo_fastText_score\" > 0.03)) AND (((tokens_per_doc_num_chars BETWEEN 0.1 AND 0.5) AND (technology_computing_label IN (\\'technology\\') OR medical_label IN (\\'medical\\') OR education_label IN (\\'education\\') OR science_label IN (\\'science\\'))) OR ((tokens_per_doc_num_chars BETWEEN 0.22 AND 0.28) AND (technology_computing_label IN (\\'cc\\') AND medical_label IN (\\'cc\\') AND education_label IN (\\'cc\\') AND science_label IN (\\'cc\\'))))'],\n",
    "        filter_logical_operator= \"OR\").transform()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
