{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 🤗 Pretraining and Finetuning with Hugging Face Models"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Want to pretrain and finetune a Hugging Face model with Composer? No problem. Here, we'll walk through using Composer to pretrain and finetune a Hugging Face model.\n",
    "\n",
    "### Recommended Background\n",
    "\n",
    "If you have already gone through our [tutorial][huggingface] on finetuning a pretrained Hugging Face model with Composer, many parts of this tutorial will be familiar to you, but it is not necessary to do that one first.\n",
    "\n",
    "This tutorial assumes you are familiar with transformer models for NLP and with Hugging Face.\n",
    "\n",
    "To better understand the Composer part, make sure you're comfortable with the material in our [Getting Started][getting_started] tutorial.\n",
    "\n",
    "### Tutorial Goals and Concepts Covered\n",
    "\n",
    "The goal of this tutorial is to demonstrate how to pretrain and finetune a Hugging Face transformer using the Composer library!\n",
    "\n",
    "Inspired by [this paper][downstream] showing that performing unsupervised pretraining on the downstream dataset can be surprisingly effective, we will focus on pretraining and finetuning a small version of [Electra][electra] on the [AG News][agnews] dataset!\n",
    "\n",
    "Along the way, we will touch on:\n",
    "\n",
    "* Creating our Hugging Face model, tokenizer, and data loaders\n",
    "* Wrapping the Hugging Face model as a `ComposerModel` for use with the Composer trainer\n",
    "* Reloading the pretrained model with a new head for sequence classification\n",
    "* Training with Composer\n",
    "\n",
    "Let's do this 🚀\n",
    "\n",
    "[huggingface]: https://docs.mosaicml.com/projects/composer/en/stable/examples/huggingface_models.html\n",
    "[getting_started]: https://docs.mosaicml.com/projects/composer/en/stable/examples/getting_started.html\n",
    "[downstream]: https://arxiv.org/abs/2209.14389\n",
    "[agnews]: https://paperswithcode.com/sota/text-classification-on-ag-news\n",
    "[electra]: https://arxiv.org/abs/2003.10555"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Install Composer\n",
    "\n",
    "To use Hugging Face with Composer, we'll need to install Composer *with the NLP dependencies*. If you haven't already, run: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install 'mosaicml[nlp]'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Import Hugging Face Model\n",
    "First, we import an Electra model and its associated tokenizer from the transformers library. We use Electra small in this notebook so that our model trains quickly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import transformers\n",
    "from composer.utils import reproducibility\n",
    "\n",
    "# Create an Electra masked language modeling model using Hugging Face transformers\n",
    "# Note: this is just loading the model architecture, and is using randomly initialized weights, so it is important to set\n",
    "# the random seed here\n",
    "reproducibility.seed_all(17)\n",
    "config = transformers.AutoConfig.from_pretrained('google/electra-small-discriminator')\n",
    "model = transformers.AutoModelForMaskedLM.from_config(config)\n",
    "tokenizer = transformers.AutoTokenizer.from_pretrained('google/electra-small-discriminator') "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Creating Dataloaders"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For the purpose of this tutorial, we are going to perform unsupervised pretraining (masked language modeling) on our downstream dataset, AG News. We are only going to train for one epoch here, but note that the [paper][downstream] that showed good performance from pretraining on the downstream dataset trained for much longer.\n",
    "\n",
    "[downstream]: https://arxiv.org/abs/2209.14389"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import datasets\n",
    "from torch.utils.data import DataLoader\n",
    "\n",
    "# Load the AG News dataset from Hugging Face\n",
    "agnews_dataset = datasets.load_dataset('ag_news')\n",
    "\n",
    "# Split the dataset randomly into a train and eval set\n",
    "split_dict = agnews_dataset['train'].train_test_split(test_size=0.2, shuffle=True, seed=17)\n",
    "train_dataset = split_dict['train']\n",
    "eval_dataset = split_dict['test']\n",
    "\n",
    "text_column_name = 'text'\n",
    "\n",
    "# Tokenize the datasets\n",
    "def tokenize_function(examples):\n",
    "    # Remove empty lines\n",
    "    examples[text_column_name] = [\n",
    "        line for line in examples[text_column_name] if len(line) > 0 and not line.isspace()\n",
    "    ]\n",
    "    return tokenizer(\n",
    "        examples[text_column_name],\n",
    "        padding='max_length',\n",
    "        truncation=True,\n",
    "        max_length=256,\n",
    "        return_special_tokens_mask=True,\n",
    "    )\n",
    "\n",
    "tokenized_train = train_dataset.map(\n",
    "    tokenize_function,\n",
    "    batched=True,\n",
    "    remove_columns=[text_column_name, 'label'],\n",
    "    load_from_cache_file=False,\n",
    ")\n",
    "tokenized_eval = eval_dataset.map(\n",
    "    tokenize_function,\n",
    "    batched=True,\n",
    "    remove_columns=[text_column_name, 'label'],\n",
    "    load_from_cache_file=False,\n",
    ")\n",
    "\n",
    "# We use the language modeling data collator from Hugging Face which will handle preparing the inputs correctly\n",
    "# for masked language modeling\n",
    "collator = transformers.DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)\n",
    "\n",
    "# Create the dataloaders\n",
    "train_dataloader = DataLoader(tokenized_train, batch_size=64, collate_fn=collator)\n",
    "eval_dataloader = DataLoader(tokenized_eval, batch_size=64, collate_fn=collator)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Convert model to `ComposerModel`\n",
    "\n",
    "Composer uses `HuggingFaceModel` as a convenient interface for wrapping a Hugging Face model (such as the one we created above) in a `ComposerModel`. Its parameters are:\n",
    "\n",
    "- `model`: The Hugging Face model to wrap.\n",
    "- `tokenizer`: The Hugging Face tokenizer used to create the input data\n",
    "- `metrics`: A list of torchmetrics to apply to the output of `eval_forward` (a `ComposerModel` method).\n",
    "- `use_logits`: A boolean which, if True, flags that the model's output logits should be used to calculate validation metrics.\n",
    "\n",
    "See the [API Reference][api] for additional details.\n",
    "\n",
    "[api]: https://docs.mosaicml.com/projects/composer/en/stable/api_reference/generated/composer.models.HuggingFaceModel.html"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from composer.metrics.nlp import LanguageCrossEntropy, MaskedAccuracy\n",
    "from composer.models.huggingface import HuggingFaceModel\n",
    "\n",
    "metrics = [\n",
    "    LanguageCrossEntropy(ignore_index=-100),\n",
    "    MaskedAccuracy(ignore_index=-100)\n",
    "]\n",
    "# Package as a trainer-friendly Composer model\n",
    "composer_model = HuggingFaceModel(model, tokenizer=tokenizer, metrics=metrics, use_logits=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Optimizers and Learning Rate Schedulers"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The last setup step is to create an optimizer and a learning rate scheduler. We will use Composer's [DecoupledAdamW][optimizer] optimizer and [LinearWithWarmupScheduler][scheduler].\n",
    "\n",
    "[optimizer]: https://docs.mosaicml.com/projects/composer/en/latest/api_reference/generated/composer.optim.DecoupledAdamW.html\n",
    "[scheduler]: https://docs.mosaicml.com/projects/composer/en/latest/api_reference/generated/composer.optim.LinearWithWarmupScheduler.html"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from composer.optim import DecoupledAdamW, LinearWithWarmupScheduler\n",
    "\n",
    "optimizer = DecoupledAdamW(composer_model.parameters(), lr=1.0e-4, betas=[0.9, 0.98], eps=1.0e-06, weight_decay=1.0e-5)\n",
    "lr_scheduler = LinearWithWarmupScheduler(t_warmup='250ba', alpha_f=0.02)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Composer Trainer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will now specify a Composer `Trainer` object and run our training! `Trainer` has many arguments that are described in our [documentation](https://docs.mosaicml.com/projects/composer/en/stable/api_reference/generated/composer.Trainer.html#trainer), so we'll discuss only the less-obvious arguments used below:\n",
    "\n",
    "- `max_duration` - a string specifying how long to train. This can be in terms of batches (e.g., `'10ba'` is 10 batches) or epochs (e.g., `'1ep'` is 1 epoch), [among other options][time].\n",
    "- `save_folder` - a string specifying where to save checkpoints to\n",
    "- `schedulers` - a (list of) PyTorch or Composer learning rate scheduler(s) that will be composed together.\n",
    "- `device` - specifies if the training will be done on CPU or GPU by using `'cpu'` or `'gpu'`, respectively. You can omit this to automatically train on GPUs if they're available and fall back to the CPU if not.\n",
    "- `train_subset_num_batches` - specifies the number of training batches to use for each epoch. This is not a necessary argument but is useful for quickly testing code.\n",
    "- `precision` - whether to do the training in full precision (`'fp32'`) or mixed precision (`'amp_fp16'` or `'amp_bf16'`). Mixed precision can provide a ~2x training speedup on recent NVIDIA GPUs.\n",
    "- `seed` - sets the random seed for the training run, so the results are reproducible!\n",
    "\n",
    "[time]: https://docs.mosaicml.com/projects/composer/en/stable/trainer/time.html"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import torch\n",
    "from composer import Trainer\n",
    "\n",
    "# Create Trainer Object\n",
    "trainer = Trainer(\n",
    "    model=composer_model, # This is the model from the HuggingFaceModel wrapper class.\n",
    "    train_dataloader=train_dataloader,\n",
    "    eval_dataloader=eval_dataloader,\n",
    "    max_duration='1ep', # train for more epochs to get better performance\n",
    "    save_folder='checkpoints/pretraining/',\n",
    "    optimizers=optimizer,\n",
    "    schedulers=[lr_scheduler],\n",
    "    device='gpu' if torch.cuda.is_available() else 'cpu',\n",
    "    # train_subset_num_batches=100, # uncomment this line to only run part of training, which will be faster\n",
    "    precision='fp32',\n",
    "    seed=17,\n",
    ")\n",
    "# Start training\n",
    "trainer.fit()\n",
    "trainer.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Loading the pretrained model for finetuning"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we have a pretrained Hugging Face model, we will load it in and finetune it on a sequence classification task. Composer provides utilities to easily reload a Hugging Face model and tokenizer from a composer checkpoint, and add a task specific head to the model so that it can be finetuned for a new task"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from torchmetrics.classification import MulticlassAccuracy\n",
    "from composer.metrics import CrossEntropy\n",
    "from composer.models import HuggingFaceModel\n",
    "\n",
    "# Note: this does not load the weights, just the right model/tokenizer class and config.\n",
    "# The weights will be loaded by the Composer trainer\n",
    "model, tokenizer = HuggingFaceModel.hf_from_composer_checkpoint(\n",
    "    f'checkpoints/pretraining/latest-rank0.pt',\n",
    "    model_instantiation_class='transformers.AutoModelForSequenceClassification',\n",
    "    model_config_kwargs={'num_labels': 4})\n",
    "\n",
    "metrics = [CrossEntropy(), MulticlassAccuracy(num_classes=4, average='micro')]\n",
    "composer_model = HuggingFaceModel(model, tokenizer=tokenizer, metrics=metrics, use_logits=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The next part should look very familiar if you have already gone through the [tutorial][huggingface], as it is exactly the same except using a different dataset and starting model!\n",
    "\n",
    "[huggingface]: https://docs.mosaicml.com/projects/composer/en/stable/examples/huggingface_models.html"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will now finetune on the AG News dataset. We have already downloaded and split the dataset, so now we just need to prepare the dataset for finetuning."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import datasets\n",
    "\n",
    "text_column_name = 'text'\n",
    "\n",
    "def tokenize_function(sample):\n",
    "    return tokenizer(\n",
    "        text=sample[text_column_name],\n",
    "        padding=\"max_length\",\n",
    "        max_length=256,\n",
    "        truncation=True\n",
    "    )\n",
    "\n",
    "tokenized_train = train_dataset.map(\n",
    "    tokenize_function,\n",
    "    batched=True,\n",
    "    remove_columns=['text'],\n",
    "    load_from_cache_file=False,\n",
    ")\n",
    "tokenized_eval = eval_dataset.map(\n",
    "    tokenize_function,\n",
    "    batched=True,\n",
    "    remove_columns=['text'],\n",
    "    load_from_cache_file=False,\n",
    ")\n",
    "\n",
    "from torch.utils.data import DataLoader\n",
    "data_collator = transformers.data.data_collator.default_data_collator\n",
    "train_dataloader = DataLoader(tokenized_train, batch_size=32, shuffle=False, drop_last=False, collate_fn=data_collator)\n",
    "eval_dataloader = DataLoader(tokenized_eval, batch_size=32, shuffle=False, drop_last=False, collate_fn=data_collator)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next we will create our optimizer and learning rate scheduler for the finetuning task."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from composer.optim import DecoupledAdamW, LinearWithWarmupScheduler\n",
    "\n",
    "optimizer = DecoupledAdamW(composer_model.parameters(), lr=1.0e-4, betas=[0.9, 0.98], eps=1.0e-06, weight_decay=3.0e-4)\n",
    "lr_scheduler = LinearWithWarmupScheduler(t_warmup='0.06dur', alpha_f=0.02)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Lastly we can make our finetuning trainer and train! The only new arguments to the trainer here are `load_path`, which tells Composer where to load the already trained weights from, and `load_weights_only`, which tells Composer that we only want to load the weights from the checkpoint, not any other state from the previous training run."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import torch\n",
    "from composer import Trainer\n",
    "\n",
    "# Create Trainer Object\n",
    "trainer = Trainer(\n",
    "    model=composer_model, # This is the model from the HuggingFaceModel wrapper class.\n",
    "    train_dataloader=train_dataloader,\n",
    "    eval_dataloader=eval_dataloader,\n",
    "    max_duration='1ep', # Again, training for more epochs is likely to lead to higher performance\n",
    "    save_folder='checkpoints/finetuning/',\n",
    "    load_path=f'checkpoints/pretraining/latest-rank0.pt',\n",
    "    load_weights_only=True, # We're starting a new training run, so we just the model weights\n",
    "    load_strict_model_weights=False, # We're going from the original model, which is for MaskedLM, to a new model, for SequenceClassification\n",
    "    optimizers=optimizer,\n",
    "    schedulers=[lr_scheduler],\n",
    "    device='gpu' if torch.cuda.is_available() else 'cpu',\n",
    "    precision='fp32',\n",
    "    seed=17,\n",
    ")\n",
    "# Start training\n",
    "trainer.fit()\n",
    "trainer.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Not bad, we got up to 91.5% accuracy on our eval split! Note that this is considerably less than the state-of-the-art on this task, but we started from a randomly initialized model, and did not train for very long, either in pretraining or finetuning!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are many possibilities for how to improve performance. Using a larger model and training for longer are often the first thing to try to improve performance (given a fixed dataset). You can also tweak the hyperparameters, try a different model class, start from pretrained weights instead of randomly initialized, or try adding some of Composer's [algorithms][algorithms]. We encourage you to play around with these and other things to get familiar with training in Composer.\n",
    "\n",
    "[algorithms]: https://docs.mosaicml.com/projects/composer/en/stable/trainer/algorithms.html"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## What next?\n",
    "\n",
    "You've now seen how to use the Composer `Trainer` to pretrain and finetune a Hugging Face model on the AG News dataset.\n",
    "\n",
    "If you want to keep learning more, try looking through some of the documents linked throughout this tutorial to see if you can form a deeper intuition for what's going on in these examples.\n",
    "\n",
    "In addition, please continue to explore our tutorials and examples! Here are a couple suggestions:\n",
    "\n",
    "* Explore more advanced applications of Composer like [applying image segmentation to medical images][image_segmentation_tutorial].\n",
    "\n",
    "* Learn about callbacks and how to apply [early stopping][early_stopping_tutorial].\n",
    "\n",
    "* Check out the [benchmarks][benchmarks] repo for full examples of training large language models like GPT and BERT, image segmentation models like DeepLab, and more!\n",
    "\n",
    "[benchmarks]: https://github.com/mosaicml/benchmarks\n",
    "[image_segmentation_tutorial]: https://docs.mosaicml.com/projects/composer/en/stable/examples/medical_image_segmentation.html\n",
    "[early_stopping_tutorial]: https://docs.mosaicml.com/projects/composer/en/stable/examples/early_stopping.html"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Come get involved with MosaicML!\n",
    "\n",
    "We'd love for you to get involved with the MosaicML community in any of these ways:\n",
    "\n",
    "### [Star Composer on GitHub](https://github.com/mosaicml/composer)\n",
    "\n",
    "Help make others aware of our work by [starring Composer on GitHub](https://github.com/mosaicml/composer).\n",
    "\n",
    "### [Join the MosaicML Slack](https://join.slack.com/t/mosaicml-community/shared_invite/zt-w0tiddn9-WGTlRpfjcO9J5jyrMub1dg)\n",
    "\n",
    "Head on over to the [MosaicML slack](https://join.slack.com/t/mosaicml-community/shared_invite/zt-w0tiddn9-WGTlRpfjcO9J5jyrMub1dg) to join other ML efficiency enthusiasts. Come for the paper discussions, stay for the memes!\n",
    "\n",
    "### Contribute to Composer\n",
    "\n",
    "Is there a bug you noticed or a feature you'd like? File an [issue](https://github.com/mosaicml/composer/issues) or make a [pull request](https://github.com/mosaicml/composer/pulls)!"
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
