{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 🤗 Finetuning Hugging Face Models"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Want to use Hugging Face models with Composer? No problem. Here, we'll walk through using Composer to fine-tune a pretrained Hugging Face BERT model.\n",
    "\n",
    "### Recommended Background\n",
    "\n",
    "This tutorial assumes you are familiar with transformer models for NLP and with Hugging Face.\n",
    "\n",
    "To better understand the Composer part, make sure you're comfortable with the material in our [Getting Started][getting_started] tutorial.\n",
    "\n",
    "### Tutorial Goals and Concepts Covered\n",
    "\n",
    "The goal of this tutorial is to demonstrate how to fine-tune a pretrained Hugging Face transformer using the Composer library!\n",
    "\n",
    "We will focus on fine-tuning a pretrained BERT-base model on the Stanford Sentiment Treebank v2 (SST-2) dataset. After fine-tuning, the BERT model should be able to determine if a sentence has positive or negative sentiment.\n",
    "\n",
    "Along the way, we will touch on:\n",
    "\n",
    "* Creating our Hugging Face BERT model, tokenizer, and data loaders\n",
    "* Wrapping the Hugging Face model as a `ComposerModel` for use with the Composer trainer\n",
    "* Training with Composer\n",
    "* Visualization examples\n",
    "\n",
    "Let's do this 🚀\n",
    "\n",
    "[getting_started]: https://docs.mosaicml.com/projects/composer/en/stable/examples/getting_started.html"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Install Composer\n",
    "\n",
    "To use Hugging Face with Composer, we'll need to install Composer *with the NLP dependencies*. If you haven't already, run: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install 'mosaicml[nlp, tensorboard]'\n",
    "# To install from source instead of the last release, comment the command above and uncomment the following one.\n",
    "# %pip install 'mosaicml[nlp, tensorboard] @ git+https://github.com/mosaicml/composer.git'\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Import Hugging Face Pretrained Model\n",
    "First, we import a pretrained BERT model (specifically, BERT-base for uncased text) and its associated tokenizer from the transformers library.\n",
    "\n",
    "Sentiment classification has two labels, so we set `num_labels=2` when creating our model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import transformers\n",
    "\n",
    "# Create a BERT sequence classification model using Hugging Face transformers\n",
    "model = transformers.AutoModelForSequenceClassification.from_pretrained('google-bert/bert-base-uncased', num_labels=2)\n",
    "tokenizer = transformers.AutoTokenizer.from_pretrained('google-bert/bert-base-uncased') "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Creating Dataloaders"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we will download and tokenize the SST-2 datasets. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import datasets\n",
    "import os\n",
    "from multiprocessing import cpu_count\n",
    "\n",
    "# Create BERT tokenizer\n",
    "def tokenize_function(sample):\n",
    "    return tokenizer(\n",
    "        text=sample['sentence'],\n",
    "        padding=\"max_length\",\n",
    "        max_length=256,\n",
    "        truncation=True\n",
    "    )\n",
    "\n",
    "# Tokenize SST-2\n",
    "sst2_dataset = datasets.load_dataset(\"glue\", \"sst2\", num_proc=os.cpu_count() - 1)\n",
    "tokenized_sst2_dataset = sst2_dataset.map(tokenize_function,\n",
    "                                          batched=True, \n",
    "                                          num_proc=cpu_count(),\n",
    "                                          batch_size=100,\n",
    "                                          remove_columns=['idx', 'sentence'])\n",
    "\n",
    "# Split dataset into train and validation sets\n",
    "train_dataset = tokenized_sst2_dataset[\"train\"]\n",
    "eval_dataset = tokenized_sst2_dataset[\"validation\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here, we will create a PyTorch `DataLoader` for each of the datasets generated in the previous block."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from torch.utils.data import DataLoader\n",
    "data_collator = transformers.data.data_collator.default_data_collator\n",
    "train_dataloader = DataLoader(train_dataset, batch_size=16, shuffle=False, drop_last=False, collate_fn=data_collator)\n",
    "eval_dataloader = DataLoader(eval_dataset,batch_size=16, shuffle=False, drop_last=False, collate_fn=data_collator)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Convert model to `ComposerModel`\n",
    "\n",
    "Composer uses `HuggingFaceModel` as a convenient interface for wrapping a Hugging Face model (such as the one we created above) in a `ComposerModel`. Its parameters are:\n",
    "\n",
    "- `model`: The Hugging Face model to wrap.\n",
    "- `tokenizer`: The Hugging Face tokenizer used to create the input data\n",
    "- `metrics`: A list of torchmetrics to apply to the output of `eval_forward` (a `ComposerModel` method).\n",
    "- `use_logits`: A boolean which, if True, flags that the model's output logits should be used to calculate validation metrics.\n",
    "\n",
    "See the [API Reference][api] for additional details.\n",
    "\n",
    "[api]: https://docs.mosaicml.com/projects/composer/en/stable/api_reference/generated/composer.models.HuggingFaceModel.html"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from torchmetrics.classification import MulticlassAccuracy\n",
    "from composer.models.huggingface import HuggingFaceModel\n",
    "from composer.metrics import CrossEntropy\n",
    "\n",
    "metrics = [CrossEntropy(), MulticlassAccuracy(num_classes=2, average='micro')]\n",
    "# Package as a trainer-friendly Composer model\n",
    "composer_model = HuggingFaceModel(model, tokenizer=tokenizer, metrics=metrics, use_logits=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Optimizers and Learning Rate Schedulers"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The last setup step is to create an optimizer and a learning rate scheduler. We will use PyTorch's AdamW optimizer and linear learning rate scheduler since these are typically used to fine-tune BERT on tasks such as SST-2."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from torch.optim import AdamW\n",
    "from torch.optim.lr_scheduler import LinearLR\n",
    "\n",
    "optimizer = AdamW(\n",
    "    params=composer_model.parameters(),\n",
    "    lr=3e-5, betas=(0.9, 0.98),\n",
    "    eps=1e-6, weight_decay=3e-6\n",
    ")\n",
    "linear_lr_decay = LinearLR(\n",
    "    optimizer, start_factor=1.0,\n",
    "    end_factor=0, total_iters=150\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Composer Trainer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will now specify a Composer `Trainer` object and run our training! `Trainer` has many arguments that are described in our [documentation](https://docs.mosaicml.com/projects/composer/en/stable/api_reference/generated/composer.Trainer.html#trainer), so we'll discuss only the less-obvious arguments used below:\n",
    "\n",
    "- `max_duration` - a string specifying how long to train. This can be in terms of batches (e.g., `'10ba'` is 10 batches) or epochs (e.g., `'1ep'` is 1 epoch), [among other options][time].\n",
    "- `schedulers` - a (list of) PyTorch or Composer learning rate scheduler(s) that will be composed together.\n",
    "- `device` - specifies if the training will be done on CPU or GPU by using `'cpu'` or `'gpu'`, respectively. You can omit this to automatically train on GPUs if they're available and fall back to the CPU if not.\n",
    "- `train_subset_num_batches` - specifies the number of training batches to use for each epoch. This is not a necessary argument but is useful for quickly testing code.\n",
    "- `precision` - whether to do the training in full precision (`'fp32'`) or mixed precision (`'amp'`). Mixed precision can provide a ~2x training speedup on recent NVIDIA GPUs.\n",
    "- `seed` - sets the random seed for the training run, so the results are reproducible!\n",
    "\n",
    "[time]: https://docs.mosaicml.com/projects/composer/en/stable/trainer/time.html"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import torch\n",
    "from composer import Trainer\n",
    "\n",
    "# Create Trainer Object\n",
    "trainer = Trainer(\n",
    "    model=composer_model, # This is the model from the HuggingFaceModel wrapper class.\n",
    "    train_dataloader=train_dataloader,\n",
    "    eval_dataloader=eval_dataloader,\n",
    "    max_duration=\"1ep\",\n",
    "    optimizers=optimizer,\n",
    "    schedulers=[linear_lr_decay],\n",
    "    device='gpu' if torch.cuda.is_available() else 'cpu',\n",
    "    train_subset_num_batches=150,\n",
    "    precision='fp32',\n",
    "    seed=17\n",
    ")\n",
    "# Start training\n",
    "trainer.fit()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Visualizing Results"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To check the training's validation accuracy, we read the `Trainer` object `state.eval_metrics` "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "trainer.state.eval_metrics"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our model reaches ~86% accuracy with only 150 iterations of training! \n",
    "Let's visualize a few samples from the validation set to see how our model performs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "eval_batch = next(iter(eval_dataloader))\n",
    "\n",
    "# Move batch to gpu\n",
    "eval_batch = {k: v.cuda() if torch.cuda.is_available() else v for k, v in eval_batch.items()}\n",
    "with torch.no_grad():\n",
    "    predictions = composer_model(eval_batch)[\"logits\"].argmax(dim=1)\n",
    "\n",
    "# Visualize only 5 samples\n",
    "predictions = predictions[:5]\n",
    "\n",
    "label = ['negative', 'positive']\n",
    "for i, prediction in enumerate(predictions):\n",
    "    sentence = sst2_dataset[\"validation\"][i][\"sentence\"]\n",
    "    correct_label = label[sst2_dataset[\"validation\"][i][\"label\"]]\n",
    "    prediction_label = label[prediction]\n",
    "    print(f\"Sample: {sentence}\")\n",
    "    print(f\"Label: {correct_label}\")\n",
    "    print(f\"Prediction: {prediction_label}\")\n",
    "    print()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Save Fine-Tuned Model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, to save the fine-tuned model parameters we call the PyTorch `save` method and pass it the model's `state_dict`: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "torch.save(trainer.state.model.state_dict(), 'model.pt')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## What next?\n",
    "\n",
    "You've now seen how to use the Composer `Trainer` to fine-tune a pre-trained Hugging Face BERT on a subset of the SST-2 dataset.\n",
    "\n",
    "If you want to keep learning more, try looking through some of the documents linked throughout this tutorial to see if you can form a deeper intuition for what's going on in these examples.\n",
    "\n",
    "In addition, please continue to explore our tutorials and examples! Here are a couple suggestions:\n",
    "\n",
    "* Explore domain-specific pretraining of a Hugging Face model in a second Hugging Face + Composer [tutorial][tutorial].\n",
    "\n",
    "* Explore more advanced applications of Composer like [applying image segmentation to medical images][image_segmentation_tutorial].\n",
    "\n",
    "* Learn about callbacks and how to apply [early stopping][early_stopping_tutorial].\n",
    "\n",
    "* Check out the [examples][examples] repo for full examples of training large language models like GPT and BERT, image segmentation models like DeepLab, and more!\n",
    "\n",
    "[tutorial]: https://docs.mosaicml.com/projects/composer/en/stable/examples/pretrain_finetune_huggingface.html\n",
    "[examples]: https://github.com/mosaicml/examples\n",
    "[image_segmentation_tutorial]: https://docs.mosaicml.com/projects/composer/en/stable/examples/medical_image_segmentation.html\n",
    "[early_stopping_tutorial]: https://docs.mosaicml.com/projects/composer/en/stable/examples/early_stopping.html"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Come get involved with MosaicML!\n",
    "\n",
    "We'd love for you to get involved with the MosaicML community in any of these ways:\n",
    "\n",
    "### [Star Composer on GitHub](https://github.com/mosaicml/composer)\n",
    "\n",
    "Help make others aware of our work by [starring Composer on GitHub](https://github.com/mosaicml/composer).\n",
    "\n",
    "### [Join the MosaicML Slack](https://join.slack.com/t/mosaicml-community/shared_invite/zt-w0tiddn9-WGTlRpfjcO9J5jyrMub1dg)\n",
    "\n",
    "Head on over to the [MosaicML slack](https://join.slack.com/t/mosaicml-community/shared_invite/zt-w0tiddn9-WGTlRpfjcO9J5jyrMub1dg) to join other ML efficiency enthusiasts. Come for the paper discussions, stay for the memes!\n",
    "\n",
    "### Contribute to Composer\n",
    "\n",
    "Is there a bug you noticed or a feature you'd like? File an [issue](https://github.com/mosaicml/composer/issues) or make a [pull request](https://github.com/mosaicml/composer/pulls)!"
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
