{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# ⏯️ Autoresume Training\n",
    "\n",
    "We all know the pain of having a training run die in the middle of training. Composer's autoresume feature provides a convenient solution to picking back up automatically from the last checkpoint when re-running a previously failed training script.\n",
    "\n",
    "We've put together this tutorial to demonstrate this feature in action and how you can activate it through the Composer trainer.\n",
    "\n",
    "**🐕 Autoresume via Watchdog**: Composer autoresumption works best when coupled with automated node failure detection and retries on Mosaic AI training. \n",
    "See our [platform docs page](https://docs.mosaicml.com/projects/mcli/en/latest/training/watchdog.html) on enabling this feature for your runs\n",
    "\n",
    "### Recommended Background\n",
    "\n",
    "This tutorial assumes you are familiar with the Composer trainer basics and its [saving/checkpointing features][checkpointing]. \n",
    "\n",
    "### Tutorial Goals and Concepts Covered\n",
    "\n",
    "The goal of this tutorial is to demonstrate autoresume by simulating a failed and re-run training script with the Composer trainer. More details on that below.\n",
    "\n",
    "For a deeper look into the way autoresume works, check out our more in depth [notes][autoresume].\n",
    "\n",
    "[checkpointing]: https://docs.mosaicml.com/projects/composer/en/stable/trainer/checkpointing.html\n",
    "[autoresume]: https://docs.mosaicml.com/projects/composer/en/stable/notes/resumption.html"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Install Dependencies\n",
    "\n",
    "Install Composer, if it isn't already installed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install mosaicml \n",
    "# To install from source instead of the last release, comment the command above and uncomment the following one.\n",
    "# %pip install git+https://github.com/mosaicml/composer.git"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Training Script\n",
    "\n",
    "Let's use the block of code below as our training script. Autoresume comes in handy when training gets interrupted for some reason (say, your node dies). In those cases, being able to seamlessly pick up from your last checkpoint can be a huge time saver. We can simulate such a situation by manually interrupting our \"training script\" (the below cell) and re-running with autoresume!\n",
    "\n",
    "**To see this example in action, run this notebook twice.**\n",
    "\n",
    "* The first time the notebook is run, the trainer will save a checkpoint to the `save_folder` and train\n",
    "  for one epoch.\n",
    "* Any subsequent time the notebook is run, the trainer will resume from the latest checkpoint if using `autoresume=True`. If\n",
    "  the latest checkpoint was saved at ``max_duration``, meaning all training is finished, the Trainer will\n",
    "  exit immediately with an error that no training would occur.\n",
    "\n",
    "When the Trainer is configured with `autoresume=True`, it will automatically look for existing checkpoints and resume training. If no checkpoints exist, it'll start a new training run. This allows you to automatically resume from any faults, with no code changes.\n",
    "\n",
    "To simulate a flaky spot instance, try interrupting the notebook (e.g. Ctrl-C) midway through the\n",
    "first training run (say, after epoch 0 is finished). Notice how the progress bars resume at the next\n",
    "epoch and not repeat any training already completed.\n",
    "\n",
    "This feature does not require code or configuration changes to distinguish between starting a new training\n",
    "run or automatically resuming from an existing one, making it easy to use Composer on preemptable cloud instances.\n",
    "Simply configure the instance to start Composer with the same command every time until training has finished!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import torch\n",
    "import torch.nn as nn\n",
    "import torch.nn.functional as F\n",
    "\n",
    "class MNISTModel(nn.Module):\n",
    "    \"\"\"Toy convolutional neural network architecture in pytorch for MNIST.\"\"\"\n",
    "\n",
    "    def __init__(self, num_classes: int = 10):\n",
    "        super().__init__()\n",
    "        self.num_classes = num_classes\n",
    "        self.conv1 = nn.Conv2d(1, 16, (3, 3), padding=0)\n",
    "        self.conv2 = nn.Conv2d(16, 32, (3, 3), padding=0)\n",
    "        self.bn = nn.BatchNorm2d(32)\n",
    "        self.fc1 = nn.Linear(32 * 16, 32)\n",
    "        self.fc2 = nn.Linear(32, num_classes)\n",
    "\n",
    "    def forward(self, x):\n",
    "        out = self.conv1(x)\n",
    "        out = F.relu(out)\n",
    "        out = self.conv2(out)\n",
    "        out = self.bn(out)\n",
    "        out = F.relu(out)\n",
    "        out = F.adaptive_avg_pool2d(out, (4, 4))\n",
    "        out = torch.flatten(out, 1, -1)\n",
    "        out = self.fc1(out)\n",
    "        out = F.relu(out)\n",
    "        return self.fc2(out)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import torch.utils.data\n",
    "from torch.optim import SGD\n",
    "from torchvision.datasets import MNIST\n",
    "from torchvision.transforms import ToTensor\n",
    "\n",
    "from composer import Trainer\n",
    "from composer.models import ComposerClassifier\n",
    "\n",
    "# Configure the trainer -- here, we train a simple MNIST classifier\n",
    "model = ComposerClassifier(module=MNISTModel(num_classes=10), num_classes=10)\n",
    "optimizer = SGD(model.parameters(), lr=0.01)\n",
    "train_dataloader = torch.utils.data.DataLoader(\n",
    "    dataset=MNIST('~/datasets', train=True, download=True, transform=ToTensor()),\n",
    "    batch_size=2048,\n",
    ")\n",
    "eval_dataloader = torch.utils.data.DataLoader(\n",
    "    dataset=MNIST('~/datasets', train=True, download=True, transform=ToTensor()),\n",
    "    batch_size=2048,\n",
    ")\n",
    "\n",
    "# When using `autoresume`, it is required to specify the `run_name`, so\n",
    "# Composer will know which training run to resume\n",
    "run_name = 'my_autoresume_training_run'\n",
    "\n",
    "trainer = Trainer(\n",
    "    model=model,\n",
    "    max_duration='5ep',\n",
    "    optimizers=optimizer,\n",
    "\n",
    "    # Training data configuration\n",
    "    train_dataloader=train_dataloader,\n",
    "    train_subset_num_batches=5,  # For this example, limit each epoch to 5 batches\n",
    "\n",
    "    # Evaluation configuration\n",
    "    eval_dataloader=eval_dataloader,\n",
    "    eval_subset_num_batches=5,  # For this example, limit evaluation to 5 batches\n",
    "\n",
    "    # Checkpoint configuration\n",
    "    run_name=run_name,\n",
    "    save_folder='./my_autoresume_training_run', # Make sure to specify `save_folder` to enable saving\n",
    "    save_interval='1ep',\n",
    "\n",
    "    # Configure autoresume!\n",
    "    autoresume=True,\n",
    ")\n",
    "\n",
    "print('Training!')\n",
    "\n",
    "# Train!\n",
    "trainer.fit()\n",
    "\n",
    "# Print the number of trained epochs (should always be the `max_duration`, which is 5ep)\n",
    "print(f'\\nNumber of epochs trained: {trainer.state.timestamp.epoch}')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "## What next?\n",
    "\n",
    "You've now seen how autoresume can save you from the headache of unpredictable training interruptions.\n",
    "\n",
    "For a deeper dive, check out our more in depth [notes on this feature][autoresume].\n",
    "\n",
    "In addition, please continue to explore our tutorials! Here's a couple suggestions:\n",
    "\n",
    "* Continue learning about other Composer features like [automatic gradient accumulation][autograd].\n",
    "\n",
    "* Explore more advanced applications of Composer like [applying image segmentation to medical images][image_segmentation_tutorial] or [fine-tuning a transformer for sentiment classification][huggingface_tutorial].\n",
    "\n",
    "* Learn how to [train without local storage][no_local_storage_tutorial].\n",
    "\n",
    "[autograd]: https://docs.mosaicml.com/projects/composer/en/stable/examples/auto_microbatching.html\n",
    "[autoresume]: https://docs.mosaicml.com/projects/composer/en/stable/notes/resumption.html\n",
    "[image_segmentation_tutorial]: https://docs.mosaicml.com/projects/composer/en/stable/examples/medical_image_segmentation.html\n",
    "[huggingface_tutorial]: https://docs.mosaicml.com/projects/composer/en/stable/examples/huggingface_models.html\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Come get involved with MosaicML!\n",
    "\n",
    "We'd love for you to get involved with the MosaicML community in any of these ways:\n",
    "\n",
    "### [Star Composer on GitHub](https://github.com/mosaicml/composer)\n",
    "\n",
    "Help make others aware of our work by [starring Composer on GitHub](https://github.com/mosaicml/composer).\n",
    "\n",
    "### [Join the MosaicML Slack](https://join.slack.com/t/mosaicml-community/shared_invite/zt-w0tiddn9-WGTlRpfjcO9J5jyrMub1dg)\n",
    "\n",
    "Head on over to the [MosaicML slack](https://join.slack.com/t/mosaicml-community/shared_invite/zt-w0tiddn9-WGTlRpfjcO9J5jyrMub1dg) to join other ML efficiency enthusiasts. Come for the paper discussions, stay for the memes!\n",
    "\n",
    "### Contribute to Composer\n",
    "\n",
    "Is there a bug you noticed or a feature you'd like? File an [issue](https://github.com/mosaicml/composer/issues) or make a [pull request](https://github.com/mosaicml/composer/pulls)!"
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
