{
    "cells": [
        {
            "attachments": {},
            "cell_type": "markdown",
            "id": "f6e1c768",
            "metadata": {},
            "source": [
                "# FaceSynthetics\n",
                "\n",
                "Why wait for your data to download when you can stream it instead? Let's see how to do so with MosaicML [Streaming](https://github.com/mosaicml/streaming) and [Composer](https://github.com/mosaicml/composer).\n",
                "\n",
                "Streaming is useful for multi-node setups where workers don't have persistent storage and each element of the dataset must be downloaded exactly once. Composer is a library for training neural networks better, faster, and cheaper. In this tutorial, we'll demonstrate a streaming approach to loading our datasets, using Microsoft's FaceSynthetics dataset as an example, and we'll use composer for model training.\n",
                "\n",
                "### Recommended Background\n",
                "\n",
                "This tutorial assumes that you're reasonably familiar with the workings of datasets and dataloaders for training deep learning models. In addition, since we'll be building from a computer vision example, familiarity in that area will likely be useful as well.\n",
                "\n",
                "If you're already familiar with streaming's dataset classes ([StreamingDataset][streaming_dataset] and [MDSWriter][streaming_dataset_mds_writer]), that's great. If not, you may want to pause while working through the tutorial and look at the docs referenced along the way.\n",
                "\n",
                "### Tutorial Goals and Concepts Covered\n",
                "\n",
                "The goal of this tutorial is to showcase how to prepare the dataset and use Streaming data loading to train the model. It will consist of a few steps:\n",
                "\n",
                "1. Obtaining the dataset\n",
                "2. Preparing the dataset for streaming\n",
                "3. Streaming the dataset to the local machine\n",
                "4. Training a model using these datasets\n",
                "\n",
                "Let's get started!\n",
                "\n",
                "[streaming_dataset]: https://streaming.docs.mosaicml.com/en/stable/api_reference/generated/streaming.StreamingDataset.html#streaming.StreamingDataset\n",
                "[streaming_dataset_mds_writer]: https://streaming.docs.mosaicml.com/en/latest/api_reference/generated/streaming.MDSWriter.html"
            ]
        },
        {
            "cell_type": "markdown",
            "id": "c1946d79",
            "metadata": {},
            "source": [
                "## Setup\n",
                "\n",
                "Let's start by making sure the right packages are installed and imported.\n",
                "\n",
                "First, let's make sure we've installed our dependencies; note that `mmcv-full` will take some time to unpack. To speed things up, we have included `mmcv`, `mmsegmentation` and many other useful computer vision libraries in the `mosaicml/pytorch_vision` [Docker Image][docker_image]. We need [Composer](https://github.com/mosaicml/composer) for model training and [Streaming](https://github.com/mosaicml/streaming) for streaming the dataset.\n",
                "\n",
                "[docker_image]: https://github.com/mosaicml/composer/tree/dev/docker#docker"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "id": "79f46743",
            "metadata": {},
            "outputs": [],
            "source": [
                "%pip install mmsegmentation mmcv mmcv-full\n",
                "\n",
                "%pip install mosaicml\n",
                "# To install from source instead of the last release, comment the command above and uncomment the following one.\n",
                "# %pip install git+https://github.com/mosaicml/composer.git\n",
                "\n",
                "%pip install mosaicml-streaming\n",
                "# To install from source instead of the last release, comment the command above and uncomment the following one.\n",
                "# %pip install git+https://github.com/mosaicml/streaming.git\n",
                "\n",
                "# (Optional) To upload a streaming dataset to an AWS S3 bucket\n",
                "%pip install awscli"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "id": "13590fcf",
            "metadata": {},
            "outputs": [],
            "source": [
                "import os\n",
                "import time\n",
                "import torch\n",
                "import struct\n",
                "import shutil\n",
                "import requests\n",
                "\n",
                "from PIL import Image\n",
                "from io import BytesIO\n",
                "from zipfile import ZipFile\n",
                "from torch.utils.data import DataLoader\n",
                "from typing import Iterator, Tuple, Dict\n",
                "from torchvision import transforms as tf"
            ]
        },
        {
            "attachments": {},
            "cell_type": "markdown",
            "id": "08e68890",
            "metadata": {},
            "source": [
                "We'll be using Streaming's dataset writer called `MDSWriter` which writes the dataset in Streaming format, Composer `DeepLabV3` model which should help improve our performance even on the small hundred-image dataset, and `StreamingDataset` to load the streaming dataset."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "id": "5a4ddb22",
            "metadata": {},
            "outputs": [],
            "source": [
                "from streaming import MDSWriter, StreamingDataset\n",
                "from composer.models.deeplabv3 import composer_deeplabv3"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "id": "ae588655",
            "metadata": {},
            "outputs": [],
            "source": [
                "from composer import Trainer\n",
                "from composer.models import composer_deeplabv3\n",
                "from composer.optim import DecoupledAdamW"
            ]
        },
        {
            "cell_type": "markdown",
            "id": "3d7069fd",
            "metadata": {},
            "source": [
                "## Global settings\n",
                "\n",
                "For this tutorial, it makes the most sense to organize our global settings here rather than distribute them throughout the cells in which they're used."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "id": "a1f45ee0",
            "metadata": {},
            "outputs": [],
            "source": [
                "# the location of our dataset\n",
                "in_root = \"./dataset\"\n",
                "\n",
                "# the location of the \"remote\" streaming dataset (`sds`). \n",
                "# Upload `out_root` to your cloud storage provider of choice.\n",
                "out_root = \"./sds\"\n",
                "out_train = \"./sds/train\"\n",
                "out_test = \"./sds/test\"\n",
                "\n",
                "# the location to download the streaming dataset during training\n",
                "local = './local'\n",
                "local_train = './local/train'\n",
                "local_test = './local/test'\n",
                "\n",
                "# toggle shuffling in dataloader\n",
                "shuffle_train = True\n",
                "shuffle_test = False\n",
                "\n",
                "# possible values for a pixel in the annotation image to take\n",
                "num_classes = 20\n",
                "\n",
                "# shard size limit, in bytes\n",
                "size_limit = 1 << 25\n",
                "\n",
                "# show a progress bar while downloading\n",
                "use_tqdm = True\n",
                "\n",
                "# ratio of training data to test data\n",
                "training_ratio = 0.9\n",
                "\n",
                "# training batch size\n",
                "batch_size = 2 # this is the smallest batch size possible, \n",
                "               # increase this if your machine can handle it.\n",
                "\n",
                "# training hardware parameters\n",
                "device = \"gpu\" if torch.cuda.is_available() else \"cpu\"\n",
                "\n",
                "# number of training epochs\n",
                "train_epochs = \"3ep\" # increase the number of epochs for greater accuracy\n",
                "\n",
                "# number of images in the dataset (training + test)\n",
                "num_images = 100 # can be 100, 1_000, or 100_000\n",
                "\n",
                "# location to download the dataset zip file\n",
                "dataset_archive = \"./dataset.zip\"\n",
                "\n",
                "# remote dataset URL\n",
                "URL = f\"https://facesyntheticspubwedata.blob.core.windows.net/iccv-2021/dataset_{num_images}.zip\"\n",
                "\n",
                "# Hashing algorithm to use for dataset\n",
                "hashes = ['sha1' ,'xxh64']"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "id": "aa469aec",
            "metadata": {},
            "outputs": [],
            "source": [
                "# upload location for the dataset splits (change this if you want to upload to a different location, for example, AWS S3 bucket location)\n",
                "upload_location = None\n",
                "\n",
                "if upload_location is None:\n",
                "    upload_train_location = None\n",
                "    upload_test_location = None\n",
                "else:\n",
                "    upload_train_location = os.path.join(upload_location, 'train')\n",
                "    upload_test_location = os.path.join(upload_location, 'test')"
            ]
        },
        {
            "cell_type": "markdown",
            "id": "f1d481ed",
            "metadata": {},
            "source": [
                "## Getting the dataset"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "id": "da386c3d",
            "metadata": {},
            "outputs": [],
            "source": [
                "if not os.path.exists(dataset_archive):\n",
                "    response = requests.get(URL)\n",
                "    with open(dataset_archive, \"wb\") as dataset_file:\n",
                "        dataset_file.write(response.content)\n",
                "        \n",
                "    with ZipFile(dataset_archive, 'r') as myzip:\n",
                "        myzip.extractall(in_root)"
            ]
        },
        {
            "cell_type": "markdown",
            "id": "0425d05f",
            "metadata": {},
            "source": [
                "Next, we'll make the directories for our binary streaming dataset files."
            ]
        },
        {
            "cell_type": "markdown",
            "id": "2f083524",
            "metadata": {},
            "source": [
                "## Preparing the dataset\n",
                "\n",
                "The dataset consists of a directory of images with names in the form `123456.png`, `123456_seg.png`, and `123456_ldmks.png`. For this example, we'll only use the images with segmentation annotations as labels and ignore the landmarks for now. "
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "id": "f2ff92f1",
            "metadata": {},
            "outputs": [],
            "source": [
                "def each(dirname: str, start_ix: int = 0, end_ix: int = num_images) -> Iterator[Dict[str, bytes]]:\n",
                "    for i in range(start_ix, end_ix):\n",
                "        image = '%s/%06d.png' % (dirname, i)\n",
                "        annotation = '%s/%06d_seg.png' % (dirname, i)\n",
                "\n",
                "        with open(image, 'rb') as x, open(annotation, 'rb') as y:\n",
                "            yield {\n",
                "                'x': x.read(),\n",
                "                'y': y.read(),\n",
                "            }"
            ]
        },
        {
            "cell_type": "markdown",
            "id": "767a6f1a",
            "metadata": {},
            "source": [
                "Below, we'll set up the logic for writing our starting dataset to files that can be read using a streaming dataloader.\n",
                "\n",
                "For more information on the `MDSWriter` check out the [API reference][api].\n",
                "\n",
                "[api]: https://streaming.docs.mosaicml.com/en/latest/api_reference/generated/streaming.MDSWriter.html"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 1,
            "id": "38e59c9e",
            "metadata": {},
            "outputs": [],
            "source": [
                "def write_datasets() -> None:\n",
                "    fields = {'x': 'png', 'y': 'png'}\n",
                "    \n",
                "    num_training_images = int(num_images * training_ratio)\n",
                "    \n",
                "    start_ix, end_ix = 0, num_training_images\n",
                "    with MDSWriter(out=out_train, columns=fields, hashes=hashes, size_limit=size_limit) as out:\n",
                "        for sample in each(in_root, start_ix, end_ix):\n",
                "            out.write(sample)\n",
                "    start_ix, end_ix = end_ix, num_images  \n",
                "    with MDSWriter(out=out_test, columns=fields, hashes=hashes, size_limit=size_limit) as out:\n",
                "        for sample in each(in_root, start_ix, end_ix):\n",
                "            out.write(sample) "
            ]
        },
        {
            "cell_type": "markdown",
            "id": "810dc5d9",
            "metadata": {},
            "source": [
                "Now that we've written the datasets to `out_root`, one can upload them to a cloud storage provider, and we are ready to stream them. "
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "id": "63bfa9d7",
            "metadata": {},
            "outputs": [],
            "source": [
                "remote_train = upload_train_location or out_train # replace this with your URL for cloud streaming\n",
                "remote_test  = upload_test_location or out_test"
            ]
        },
        {
            "attachments": {},
            "cell_type": "markdown",
            "id": "64dc2143",
            "metadata": {},
            "source": [
                "## Loading the Data\n",
                "\n",
                "We extend `StreamingDataset` to deserialize the binary data and convert the labels to one-hot encoding.\n",
                "\n",
                "For more information on the `StreamingDataset` class check out the [API reference](https://streaming.docs.mosaicml.com/en/stable/api_reference/generated/streaming.StreamingDataset.html#streaming.StreamingDataset)."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "id": "5b808738",
            "metadata": {},
            "outputs": [],
            "source": [
                "class FaceSynthetics(StreamingDataset):\n",
                "    def __init__(self,\n",
                "                 remote: str,\n",
                "                 local: str,\n",
                "                 shuffle: bool,\n",
                "                 batch_size: int,\n",
                "                ) -> None:\n",
                "        super().__init__(local=local, remote=remote, shuffle=shuffle, batch_size=batch_size)\n",
                "\n",
                "    def __getitem__(self, i:int) -> Tuple[torch.Tensor, torch.Tensor]:\n",
                "        obj = super().__getitem__(i)\n",
                "        x = tf.functional.to_tensor(obj['x'])\n",
                "        y = tf.functional.pil_to_tensor(obj['y'])[0].to(torch.int64)\n",
                "        y[y == 255] = 19\n",
                "        return x, y"
            ]
        },
        {
            "cell_type": "markdown",
            "id": "48292a0d",
            "metadata": {},
            "source": [
                "## Putting It All Together"
            ]
        },
        {
            "cell_type": "markdown",
            "id": "bcc06955",
            "metadata": {},
            "source": [
                "We're now ready to actually write the streamable dataset. Let's do that if we haven't already."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "id": "77470859",
            "metadata": {},
            "outputs": [],
            "source": [
                "if not os.path.exists(out_train):\n",
                "    write_datasets()"
            ]
        },
        {
            "cell_type": "markdown",
            "id": "af441824",
            "metadata": {},
            "source": [
                "(Optional) Upload the Streaming dataset to an AWS S3 bucket of your choice. Uncomment the below line if you have provided the S3 bucket link to `upload_location`."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "id": "3c0e0625",
            "metadata": {
                "vscode": {
                    "languageId": "shellscript"
                }
            },
            "outputs": [],
            "source": [
                "# !aws s3 cp $out_root $upload_location --recursive"
            ]
        },
        {
            "cell_type": "markdown",
            "id": "74dcae51",
            "metadata": {},
            "source": [
                "Once that's done, we can instantiate our streaming datasets and wrap them in standard dataloaders for training!"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "id": "4bfefaad",
            "metadata": {},
            "outputs": [],
            "source": [
                "dataset_train = FaceSynthetics(remote_train, local_train, shuffle_train, batch_size=batch_size)\n",
                "dataset_test  = FaceSynthetics(remote_test, local_test, shuffle_test, batch_size=batch_size)\n",
                "\n",
                "train_dataloader = DataLoader(dataset_train, batch_size=batch_size)\n",
                "test_dataloader = DataLoader(dataset_test, batch_size=batch_size)"
            ]
        },
        {
            "cell_type": "markdown",
            "id": "f5be0522",
            "metadata": {},
            "source": [
                "### Train with the Streaming Dataloaders\n",
                "\n",
                "Now all that's left to do is train! Doing so with Composer should look pretty familiar by now."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "id": "f082a5e8",
            "metadata": {},
            "outputs": [],
            "source": [
                "# Create a DeepLabV3 model, and an optimizer for it\n",
                "model = composer_deeplabv3(\n",
                "    num_classes=num_classes, \n",
                "    backbone_arch='resnet101', \n",
                "    backbone_weights='IMAGENET1K_V2',\n",
                "    sync_bn=False)\n",
                "optimizer = DecoupledAdamW(model.parameters(), lr=1e-3)\n",
                "\n",
                "# Create a trainer object without our model, optimizer, and streaming dataloaders\n",
                "trainer = Trainer(\n",
                "    model=model,\n",
                "    train_dataloader=train_dataloader,\n",
                "    eval_dataloader=test_dataloader,\n",
                "    max_duration=train_epochs,\n",
                "    optimizers=optimizer,\n",
                "    device=device\n",
                ")\n",
                "\n",
                "# Train!\n",
                "start_time = time.perf_counter()\n",
                "trainer.fit()\n",
                "end_time = time.perf_counter()\n",
                "print(f\"It took {end_time - start_time:0.4f} seconds to train\")"
            ]
        },
        {
            "cell_type": "markdown",
            "id": "8f219ed4",
            "metadata": {},
            "source": [
                "## Cleanup\n",
                "\n",
                "That's it. No need to hang on to the files created by the tutorial..."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "id": "66663796",
            "metadata": {},
            "outputs": [],
            "source": [
                "shutil.rmtree(out_root, ignore_errors=True)\n",
                "shutil.rmtree(in_root, ignore_errors=True)\n",
                "shutil.rmtree(local, ignore_errors=True)\n",
                "if os.path.exists(dataset_archive):\n",
                "    os.remove(dataset_archive)"
            ]
        },
        {
            "cell_type": "markdown",
            "id": "b51f8e09",
            "metadata": {},
            "source": [
                "\n",
                "## What next?\n",
                "\n",
                "You've now seen an in-depth look at how to prepare and use streaming datasets with Composer.\n",
                "\n",
                "To continue learning about Streaming, please continue to explore our examples!"
            ]
        },
        {
            "attachments": {},
            "cell_type": "markdown",
            "id": "ca0bca40",
            "metadata": {},
            "source": [
                "## Come get involved with MosaicML!\n",
                "\n",
                "We'd love for you to get involved with the MosaicML community in any of these ways:\n",
                "\n",
                "### [Star Streaming on GitHub](https://github.com/mosaicml/streaming)\n",
                "\n",
                "Help make others aware of our work by [starring Streaming on GitHub](https://github.com/mosaicml/streaming).\n",
                "\n",
                "### [Join the MosaicML Slack](https://dub.sh/mcomm)\n",
                "\n",
                "Head on over to the [MosaicML slack](https://dub.sh/mcomm) to join other ML efficiency enthusiasts. Come for the paper discussions, stay for the memes!\n",
                "\n",
                "### Contribute to Streaming\n",
                "\n",
                "Is there a bug you noticed or a feature you'd like? File an [issue](https://github.com/mosaicml/streaming/issues) or make a [pull request](https://github.com/mosaicml/streaming/pulls)!"
            ]
        }
    ],
    "metadata": {
        "kernelspec": {
            "display_name": "Python 3.10.6 ('streaming_py3_10')",
            "language": "python",
            "name": "python3"
        },
        "language_info": {
            "codemirror_mode": {
                "name": "ipython",
                "version": 3
            },
            "file_extension": ".py",
            "mimetype": "text/x-python",
            "name": "python",
            "nbconvert_exporter": "python",
            "pygments_lexer": "ipython3",
            "version": "3.10.10"
        },
        "vscode": {
            "interpreter": {
                "hash": "cb0371d9985d03b7be04a8e8a123b72f0ef8951070c9235d824cee9281d7d420"
            }
        }
    },
    "nbformat": 4,
    "nbformat_minor": 5
}
