{
    "cells": [
        {
            "attachments": {},
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "# Synthetic NLP\n",
                "\n",
                "In this tutorial, we will demonstrate how to create a Synthetic dataset, write a synthetic dataset into a streaming format and use the [StreamingDataset][streaming_dataset] class to load the dataset.\n",
                "\n",
                "### Recommended Background\n",
                "\n",
                "This tutorial assumes that you're reasonably familiar with the workings of datasets and dataloaders for training deep learning models.\n",
                "\n",
                "If you're already familiar with streaming's dataset classes ([Dataset][streaming_dataset] and [MDSWriter][streaming_dataset_mds_writer]), that's great. If not, you may want to pause while working through the tutorial and look at the docs referenced along the way.\n",
                "\n",
                "### Tutorial Goals and Concepts Covered\n",
                "\n",
                "The goal of this tutorial is to showcase how to prepare the dataset and use Streaming data loading to iterate and fetch the samples. It will consist of a few steps:\n",
                "\n",
                "1. Generate a synthetic dataset\n",
                "2. Preparing the dataset for streaming\n",
                "3. Streaming the dataset to the local machine\n",
                "4. Iterate through the dataset and fetch the samples\n",
                "\n",
                "Let's get started!\n",
                "\n",
                "[streaming_dataset]: https://streaming.docs.mosaicml.com/en/stable/api_reference/generated/streaming.StreamingDataset.html#streaming.StreamingDataset\n",
                "[streaming_dataset_mds_writer]: https://streaming.docs.mosaicml.com/en/latest/api_reference/generated/streaming.MDSWriter.html"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Setup\n",
                "\n",
                "Let's start by making sure the right packages are installed and imported. We need to install the `mosaicml-streaming` package which installs the sufficient dependencies to run this tutorial."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "%pip install mosaicml-streaming\n",
                "# To install from source instead of the last release, comment the command above and uncomment the following one.\n",
                "# %pip install git+https://github.com/mosaicml/streaming.git\n",
                "\n",
                "# (Optional) To upload a streaming dataset to an AWS S3 bucket\n",
                "%pip install awscli"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "import os\n",
                "import shutil\n",
                "from typing import Any, Dict, List, Tuple\n",
                "\n",
                "import numpy as np\n",
                "import torch\n",
                "from torch.utils.data import DataLoader\n",
                "from tqdm import tqdm"
            ]
        },
        {
            "attachments": {},
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "We'll be using Streaming's `MDSWriter` which writes the dataset in Streaming format and `StreamingDataset` to load the streaming dataset."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "from streaming import MDSWriter, StreamingDataset"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Global settings\n",
                "\n",
                "For this tutorial, let's import some of the global setting at the start."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# the location of the \"remote\" streaming dataset (`sds`). \n",
                "# Upload `out_root` to your cloud storage provider of choice. If `out_root` is a cloud provider\n",
                "# path, shard files are automatically uploaded.\n",
                "out_root = \"./sds\"\n",
                "out_train = \"./sds/train\"\n",
                "out_val = \"./sds/val\"\n",
                "\n",
                "# the location to download the streaming dataset during training\n",
                "local = './local'\n",
                "local_train = './local/train'\n",
                "local_val = './local/val'\n",
                "\n",
                "# toggle shuffling in dataloader\n",
                "shuffle_train = True\n",
                "shuffle_val = False\n",
                "\n",
                "# training batch size\n",
                "batch_size = 512"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# upload location for the dataset splits (change this if you want to upload to a different location, for example, AWS S3 bucket location)\n",
                "upload_location = None\n",
                "\n",
                "if upload_location is None:\n",
                "    upload_train_location = None\n",
                "    upload_val_location = None\n",
                "else:\n",
                "    upload_train_location = os.path.join(upload_location, 'train')\n",
                "    upload_val_location = os.path.join(upload_location, 'val')"
            ]
        },
        {
            "attachments": {},
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Create a Synthetic NLP dataset\n",
                "\n",
                "In this tutorial, we will be creating a synthetic number-saying dataset, i.e. converting a numbers from digits to words, for example, number `123` would spell as `one hundred twenty three`. The numbers are generated sequentially with a random positive/negative prefix sign.\n",
                "\n",
                "Let's import a utility functions to generate those synthetic number-saying dataset."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# Word representation of a number\n",
                "ones = ('zero one two three four five six seven eight nine ten eleven twelve thirteen fourteen ' +\n",
                "        'fifteen sixteen seventeen eighteen nineteen').split()\n",
                "\n",
                "tens = 'twenty thirty forty fifty sixty seventy eighty ninety'.split()\n",
                "\n",
                "\n",
                "def say(i: int) -> List[str]:\n",
                "    \"\"\"Get the word form of a number.\n",
                "\n",
                "    Args:\n",
                "        i (int): The number.\n",
                "\n",
                "    Returns:\n",
                "        List[str]: The number in word form.\n",
                "    \"\"\"\n",
                "    if i < 0:\n",
                "        return ['negative'] + say(-i)\n",
                "    elif i <= 19:\n",
                "        return [ones[i]]\n",
                "    elif i < 100:\n",
                "        return [tens[i // 10 - 2]] + ([ones[i % 10]] if i % 10 else [])\n",
                "    elif i < 1_000:\n",
                "        return [ones[i // 100], 'hundred'] + (say(i % 100) if i % 100 else [])\n",
                "    elif i < 1_000_000:\n",
                "        return say(i // 1_000) + ['thousand'] + (say(i % 1_000) if i % 1_000 else [])\n",
                "    elif i < 1_000_000_000:\n",
                "        return say(i // 1_000_000) + ['million'] + (say(i % 1_000_000) if i % 1_000_000 else [])\n",
                "    else:\n",
                "        assert False\n",
                "\n",
                "def get_numbers(num_train: int, num_val: int) -> Tuple[List[int], List[int]]:\n",
                "    \"\"\"Get two non-overlapping splits of a sequential random numbers.\n",
                "\n",
                "    The train sample indices goes from [0, num_train] and val sample indices goes \n",
                "    from [num_train, num_val].\n",
                "\n",
                "    Args:\n",
                "        num_train (int): Number of training samples.\n",
                "        num_val (int): Number of validation samples.\n",
                "\n",
                "    Returns:\n",
                "        Tuple[List[int], List[int]]: The two generated splits.\n",
                "    \"\"\"\n",
                "    total = num_train + num_val\n",
                "    numbers = []\n",
                "    bar = tqdm(total=total, leave=False)\n",
                "    i = 0\n",
                "    while i < total:\n",
                "        was = len(numbers)\n",
                "        sign = (np.random.random() < 0.8) * 2 - 1\n",
                "        numbers.append(sign * i)\n",
                "        bar.update(len(numbers) - was)\n",
                "        i += 1\n",
                "    return numbers[:num_train], numbers[num_train:]\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "Initialize a method to generate a train and validation samples where each sample is a dictionary with attributes `{'number': <Integer number>, 'words': <word representation of an integer number as string>}`."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "def generate_samples(numbers: List[int]) -> List[Dict[str, Any]]:\n",
                "    \"\"\"Generate samples from a list of numbers.\n",
                "\n",
                "    Args:\n",
                "        numbers (List[int]): The numbers.\n",
                "\n",
                "    Returns:\n",
                "        List[Dict[str, Any]]: The corresponding samples.\n",
                "    \"\"\"\n",
                "    samples = []\n",
                "    for num in numbers:\n",
                "        words = ' '.join(say(num))\n",
                "        sample = {'number': num, 'words': words}\n",
                "        samples.append(sample)\n",
                "    return samples\n",
                "\n",
                "\n",
                "def get_dataset(num_train: int, num_val: int) -> Tuple[List[Dict[str, Any]], List[Dict[str, Any]]]:\n",
                "    \"\"\"Generate a number-saying dataset of the given size.\n",
                "\n",
                "    Args:\n",
                "        num_train (int): Number of training samples.\n",
                "        num_val (int): Number of validation samples.\n",
                "\n",
                "    Returns:\n",
                "        Tuple[List[Dict[str, Any]], List[Dict[str, Any]]]: The two generated splits.\n",
                "    \"\"\"\n",
                "    train_nums, val_nums = get_numbers(num_train, num_val)\n",
                "    train_samples = generate_samples(train_nums)\n",
                "    val_samples = generate_samples(val_nums)\n",
                "    return train_samples, val_samples"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "Create a non-overlapping `train` and `val` split dataset of unique random numbers."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# Number of training and validation samples\n",
                "num_train_samples = 10_000 # 10k samples\n",
                "num_val_samples = 2000    # 2k samples\n",
                "\n",
                "# Create the samples.\n",
                "print(f'Generating synthetic dataset ({num_train_samples} train, {num_val_samples} val)...')\n",
                "train_samples, val_samples = get_dataset(num_train_samples, num_val_samples)\n",
                "splits = [\n",
                "    ('train', train_samples),\n",
                "    ('val', val_samples)\n",
                "]"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "Let's visualize the first train and test sample"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "print(f'Train sample: {train_samples[0]}')\n",
                "print(f'Val sample: {val_samples[0]}')"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Convert the dataset to MosaicML Streaming\n",
                "\n",
                "We are going to use the `MDSWriter` to convert the raw synthetic NLP dataset into a `.mds` file format.\n",
                "\n",
                "For more information on the Streaming `MDSWriter` class check out the [API reference](https://streaming.docs.mosaicml.com/en/latest/api_reference/generated/streaming.MDSWriter.html)."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# Mapping of sample keyword with their data type\n",
                "columns = {\n",
                "    'number': 'int',\n",
                "    'words': 'str',\n",
                "}\n",
                "\n",
                "# Compression algorithm to use for dataset\n",
                "compression = 'zstd:12'\n",
                "\n",
                "# Hashing algorithm to use for dataset\n",
                "hashes = ['sha1', 'xxh3_64']\n",
                "\n",
                "# shard size limit, in bytes\n",
                "size_limit = 1 << 16  # Override to a small number for more shards.\n",
                "\n",
                "print(f'Saving dataset (to {out_root})...')\n",
                "for split, samples in splits:\n",
                "    print(f'* {split}')\n",
                "    dirname = os.path.join(out_root, split)\n",
                "    with MDSWriter(out=dirname, columns=columns, compression=compression, \n",
                "                   hashes=hashes, size_limit=size_limit) as out:\n",
                "        for sample in tqdm(samples, leave=False):\n",
                "            out.write(sample)"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "Now that we've written the datasets to `out_root`, one can upload them to a cloud storage provider, and we are ready to stream them. "
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "remote_train = upload_train_location or out_train # replace this with your URL for cloud streaming\n",
                "remote_val  = upload_val_location or out_val"
            ]
        },
        {
            "attachments": {},
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Loading the Data\n",
                "\n",
                "We extend Streaming's Dataset to deserialize the data. Let's verify the dataloading samples from `StreamingDataset` class with the raw samples for content validity and deterministic sample ordering.\n",
                "\n",
                "For more information on the `StreamingDataset` class check out the [API reference](https://streaming.docs.mosaicml.com/en/stable/api_reference/generated/streaming.StreamingDataset.html#streaming.StreamingDataset)."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# Load the samples back.\n",
                "print('Walking the dataset:')\n",
                "\n",
                "print(f'verifying samples for train split')\n",
                "train_dataset = StreamingDataset(remote=upload_location or out_root, local=local, split='train', shuffle=False)\n",
                "for old, new in tqdm(zip(train_samples, train_dataset), total=len(train_samples), leave=False):\n",
                "    assert old == new\n",
                "\n",
                "print(f'verifying samples for val split')\n",
                "val_dataset = StreamingDataset(remote=upload_location or out_root, local=local, split='val', shuffle=False)\n",
                "for old, new in tqdm(zip(val_samples, val_dataset), total=len(val_samples), leave=False):\n",
                "    assert old == new\n"
            ]
        },
        {
            "attachments": {},
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "We can also visualize the sample(s) by doing pythonic or NumPy indexing on a `StreamingDataset`."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# Fetch the 10th sample and print it on a console\n",
                "print(f'Sample 10: {train_dataset[10]}')\n",
                "\n",
                "# Fetch multiple samples\n",
                "indices = [-1, 30, [12, -14], slice(-1, -10, -2), np.array([10, -20])]\n",
                "for indx in indices:\n",
                "    print(f'Sample {indx}: {train_dataset[indx]}')"
            ]
        },
        {
            "attachments": {},
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "Below are some utility methods about the dataset which would be highly useful for debugging and model training. For more information on the `StreamingDataset` parameters, check out the [API reference](https://streaming.docs.mosaicml.com/en/stable/api_reference/generated/streaming.StreamingDataset.html#streaming.StreamingDataset)."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# Get the total number of samples\n",
                "print(f'Total number of samples: {train_dataset.num_samples}')\n",
                "\n",
                "# Get the number of shard files\n",
                "print(f'Total number of shards: {len(train_dataset.shards)}')\n",
                "\n",
                "# Get the number of samples inside each shard files.\n",
                "# Number of samples in each shard can vary based on each sample size.\n",
                "print(f'Number of samples inside each shards: {train_dataset.samples_per_shard}')"
            ]
        },
        {
            "attachments": {},
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "we can now wrap our streaming datasets in a standard PyTorch dataloaders for training!"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "train_dataloader = DataLoader(train_dataset, batch_size=batch_size)\n",
                "val_dataloader = DataLoader(val_dataset, batch_size=batch_size)"
            ]
        },
        {
            "attachments": {},
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "The `StreamingDataset` class supports a `batch_size` parameter, if when provided by the user, worker indices will be constructed so that there \n",
                "is at most one incomplete batch at the end of each epoch for better workload distribution across workers and lesser or no drop samples. Let's look at the example below with `drop_last=True` in PyTorch DataLoader to drop the last non-divisible batch for demonstration purpose:"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# Instantiate a streaming Dataset class without a `batch_size` parameter \n",
                "dataset_without_bs = StreamingDataset(remote=remote_train, local=os.path.join(local_train, 'wo_bs'), shuffle=shuffle_train)\n",
                "dataloader_ds_wo_bs = DataLoader(dataset_without_bs, batch_size=batch_size, num_workers=8, drop_last=True)\n",
                "\n",
                "# Instantiate a streaming Dataset class with a `batch_size` parameter \n",
                "dataset_with_bs = StreamingDataset(remote=remote_train, local=os.path.join(local_train, 'w_bs'), shuffle=shuffle_train, batch_size=batch_size)\n",
                "dataloader_ds_with_bs = DataLoader(dataset_with_bs, batch_size=batch_size, num_workers=8, drop_last=True)"
            ]
        },
        {
            "attachments": {},
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "Visualize the number of samples processed by the dataloader when `batch_size` was not provided during instantiation of `StreamingDataset` class. It would be 8192 out of total samples of 10,000."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "total_samples = 0\n",
                "for idx, batch in enumerate(dataloader_ds_wo_bs):\n",
                "    total_samples += len(batch[\"number\"])\n",
                "print(f'Total number of samples processed by the dataloader is {total_samples} out of {num_train_samples}')\n"
            ]
        },
        {
            "attachments": {},
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "Visualize the number of samples processed by the dataloader when `batch_size` was provided during instantiation of `StreamingDataset` class. We will see that the number of samples processed is higher than when `batch_size` is absent. It would be 9728 out of total samples of 10,000."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "total_samples = 0\n",
                "for idx, batch in enumerate(dataloader_ds_with_bs):\n",
                "    total_samples += len(batch[\"number\"])\n",
                "print(f'Total number of samples processed by the dataloader is {total_samples} out of {num_train_samples}')\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Cleanup\n",
                "\n",
                "That's it. No need to hang on to the files created by the tutorial..."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "shutil.rmtree(out_root, ignore_errors=True)\n",
                "shutil.rmtree(local, ignore_errors=True)"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "\n",
                "## What next?\n",
                "\n",
                "You've now seen an in-depth look at how to prepare and use streaming datasets with PyTorch.\n",
                "\n",
                "To continue learning about Streaming, please continue to explore our examples!"
            ]
        },
        {
            "attachments": {},
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Come get involved with MosaicML!\n",
                "\n",
                "We'd love for you to get involved with the MosaicML community in any of these ways:\n",
                "\n",
                "### [Star Streaming on GitHub](https://github.com/mosaicml/streaming)\n",
                "\n",
                "Help make others aware of our work by [starring Streaming on GitHub](https://github.com/mosaicml/streaming).\n",
                "\n",
                "### [Join the MosaicML Slack](https://dub.sh/mcomm)\n",
                "\n",
                "Head on over to the [MosaicML slack](https://dub.sh/mcomm) to join other ML efficiency enthusiasts. Come for the paper discussions, stay for the memes!\n",
                "\n",
                "### Contribute to Streaming\n",
                "\n",
                "Is there a bug you noticed or a feature you'd like? File an [issue](https://github.com/mosaicml/streaming/issues) or make a [pull request](https://github.com/mosaicml/streaming/pulls)!"
            ]
        }
    ],
    "metadata": {
        "kernelspec": {
            "display_name": "Python 3.10.6 ('streaming_py3_10')",
            "language": "python",
            "name": "python3"
        },
        "language_info": {
            "codemirror_mode": {
                "name": "ipython",
                "version": 3
            },
            "file_extension": ".py",
            "mimetype": "text/x-python",
            "name": "python",
            "nbconvert_exporter": "python",
            "pygments_lexer": "ipython3",
            "version": "3.8.16"
        },
        "orig_nbformat": 4,
        "vscode": {
            "interpreter": {
                "hash": "cb0371d9985d03b7be04a8e8a123b72f0ef8951070c9235d824cee9281d7d420"
            }
        }
    },
    "nbformat": 4,
    "nbformat_minor": 2
}
