{
 "cells": [
  {
   "metadata": {},
   "cell_type": "code",
   "outputs": [],
   "execution_count": null,
   "source": "# Fine-tune FLAN-T5 using `bitsandbytes`, `peft` & `transformers` 🤗 ",
   "id": "85d50b7184ee2eaa"
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "In this notebook we will see how to properly use `peft` , `transformers` & `bitsandbytes` to fine-tune `flan-t5-large` in a google colab!\n",
    "\n",
    "We will finetune the model on [`financial_phrasebank`](https://huggingface.co/datasets/financial_phrasebank) dataset, that consists of pairs of text-labels to classify financial-related sentences, if they are either `positive`, `neutral` or `negative`.\n",
    "\n",
    "Note that you could use the same notebook to fine-tune `flan-t5-xl` as well, but you would need to shard the models first to avoid CPU RAM issues on Google Colab, check [these weights](https://huggingface.co/ybelkada/flan-t5-xl-sharded-bf16)."
   ],
   "id": "7457d155918fbf81"
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "## Install requirements",
   "id": "72eb49be4472a33e"
  },
  {
   "metadata": {},
   "cell_type": "code",
   "outputs": [],
   "execution_count": null,
   "source": [
    "!pip install -q bitsandbytes datasets accelerate\n",
    "!pip install -q git+https://github.com/huggingface/transformers.git@main git+https://github.com/huggingface/peft.git@main"
   ],
   "id": "3ddb3a40a1a9fed8"
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "## Import model and tokenizer",
   "id": "a0e34106559e65a4"
  },
  {
   "metadata": {},
   "cell_type": "code",
   "outputs": [],
   "execution_count": null,
   "source": [
    "# Select CUDA device index\n",
    "import os\n",
    "import torch\n",
    "\n",
    "os.environ[\"CUDA_VISIBLE_DEVICES\"] = \"0\"\n",
    "\n",
    "from datasets import load_dataset\n",
    "from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, BitsAndBytesConfig\n",
    "\n",
    "model_name = \"google/flan-t5-large\"\n",
    "\n",
    "model = AutoModelForSeq2SeqLM.from_pretrained(model_name, quantization_config=BitsAndBytesConfig(load_in_8bit=True))\n",
    "tokenizer = AutoTokenizer.from_pretrained(model_name)"
   ],
   "id": "81bdc9377dad99ba"
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "## Prepare model for training",
   "id": "97c32a4b1da55709"
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "Some pre-processing needs to be done before training such an int8 model using `peft`, therefore let's import an utiliy function `prepare_model_for_kbit_training` that will: \n",
    "- Casts all the non `int8` modules to full precision (`fp32`) for stability\n",
    "- Add a `forward_hook` to the input embedding layer to enable gradient computation of the input hidden states\n",
    "- Enable gradient checkpointing for more memory-efficient training"
   ],
   "id": "3db22e470d838b99"
  },
  {
   "metadata": {},
   "cell_type": "code",
   "outputs": [],
   "execution_count": null,
   "source": [
    "from peft import prepare_model_for_kbit_training\n",
    "\n",
    "model = prepare_model_for_kbit_training(model)"
   ],
   "id": "f3eaed59738f0975"
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "## Load your `PeftModel` \n",
    "\n",
    "Here we will use LoRA (Low-Rank Adaptators) to train our model"
   ],
   "id": "270e5d5ec81851e6"
  },
  {
   "metadata": {},
   "cell_type": "code",
   "outputs": [],
   "execution_count": null,
   "source": [
    "from peft import LoraConfig, get_peft_model, TaskType\n",
    "\n",
    "\n",
    "def print_trainable_parameters(model):\n",
    "    \"\"\"\n",
    "    Prints the number of trainable parameters in the model.\n",
    "    \"\"\"\n",
    "    trainable_params = 0\n",
    "    all_param = 0\n",
    "    for _, param in model.named_parameters():\n",
    "        all_param += param.numel()\n",
    "        if param.requires_grad:\n",
    "            trainable_params += param.numel()\n",
    "    print(\n",
    "        f\"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}\"\n",
    "    )\n",
    "\n",
    "\n",
    "lora_config = LoraConfig(\n",
    "    r=16, lora_alpha=32, target_modules=[\"q\", \"v\"], lora_dropout=0.05, bias=\"none\", task_type=\"SEQ_2_SEQ_LM\"\n",
    ")\n",
    "\n",
    "model = get_peft_model(model, lora_config)\n",
    "print_trainable_parameters(model)"
   ],
   "id": "e471e33079849777"
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "As you can see, here we are only training 0.6% of the parameters of the model! This is a huge memory gain that will enable us to fine-tune the model without any memory issue.",
   "id": "f527d421218468e9"
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "## Load and process data\n",
    "\n",
    "Here we will use [`financial_phrasebank`](https://huggingface.co/datasets/financial_phrasebank) dataset to fine-tune our model on sentiment classification on financial sentences. We will load the split `sentences_allagree`, which corresponds according to the model card to the split where there is a 100% annotator agreement."
   ],
   "id": "477123860c9bfa35"
  },
  {
   "metadata": {},
   "cell_type": "code",
   "outputs": [],
   "execution_count": null,
   "source": [
    "# loading dataset\n",
    "dataset = load_dataset(\"financial_phrasebank\", \"sentences_allagree\")\n",
    "dataset = dataset[\"train\"].train_test_split(test_size=0.1)\n",
    "dataset[\"validation\"] = dataset[\"test\"]\n",
    "del dataset[\"test\"]\n",
    "\n",
    "classes = dataset[\"train\"].features[\"label\"].names\n",
    "dataset = dataset.map(\n",
    "    lambda x: {\"text_label\": [classes[label] for label in x[\"label\"]]},\n",
    "    batched=True,\n",
    "    num_proc=1,\n",
    ")"
   ],
   "id": "f525abf676d84dc3"
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "Let's also apply some pre-processing of the input data, the labels needs to be pre-processed, the tokens corresponding to `pad_token_id` needs to be set to `-100` so that the `CrossEntropy` loss associated with the model will correctly ignore these tokens.",
   "id": "afb642bad9bd081e"
  },
  {
   "metadata": {},
   "cell_type": "code",
   "outputs": [],
   "execution_count": null,
   "source": [
    "# data preprocessing\n",
    "text_column = \"sentence\"\n",
    "label_column = \"text_label\"\n",
    "max_length = 128\n",
    "\n",
    "\n",
    "def preprocess_function(examples):\n",
    "    inputs = examples[text_column]\n",
    "    targets = examples[label_column]\n",
    "    model_inputs = tokenizer(inputs, max_length=max_length, padding=\"max_length\", truncation=True, return_tensors=\"pt\")\n",
    "    labels = tokenizer(targets, max_length=3, padding=\"max_length\", truncation=True, return_tensors=\"pt\")\n",
    "    labels = labels[\"input_ids\"]\n",
    "    labels[labels == tokenizer.pad_token_id] = -100\n",
    "    model_inputs[\"labels\"] = labels\n",
    "    return model_inputs\n",
    "\n",
    "\n",
    "processed_datasets = dataset.map(\n",
    "    preprocess_function,\n",
    "    batched=True,\n",
    "    num_proc=1,\n",
    "    remove_columns=dataset[\"train\"].column_names,\n",
    "    load_from_cache_file=False,\n",
    "    desc=\"Running tokenizer on dataset\",\n",
    ")\n",
    "\n",
    "train_dataset = processed_datasets[\"train\"]\n",
    "eval_dataset = processed_datasets[\"validation\"]"
   ],
   "id": "465c4cfee0f7befd"
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "## Train our model! \n",
    "\n",
    "Let's now train our model, run the cells below.\n",
    "Note that for T5 since some layers are kept in `float32` for stability purposes there is no need to call autocast on the trainer."
   ],
   "id": "c99f1f657a82a2db"
  },
  {
   "metadata": {},
   "cell_type": "code",
   "outputs": [],
   "execution_count": null,
   "source": [
    "from transformers import TrainingArguments, Trainer\n",
    "\n",
    "training_args = TrainingArguments(\n",
    "    \"temp\",\n",
    "    eval_strategy=\"epoch\",\n",
    "    learning_rate=1e-3,\n",
    "    gradient_accumulation_steps=1,\n",
    "    auto_find_batch_size=True,\n",
    "    num_train_epochs=1,\n",
    "    save_steps=100,\n",
    "    save_total_limit=8,\n",
    ")\n",
    "trainer = Trainer(\n",
    "    model=model,\n",
    "    args=training_args,\n",
    "    train_dataset=train_dataset,\n",
    "    eval_dataset=eval_dataset,\n",
    ")\n",
    "model.config.use_cache = False  # silence the warnings. Please re-enable for inference!"
   ],
   "id": "194cee646ac587fc"
  },
  {
   "metadata": {},
   "cell_type": "code",
   "outputs": [],
   "execution_count": null,
   "source": "trainer.train()",
   "id": "ac87d544d57f49e1"
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "## Qualitatively test our model",
   "id": "3cee531a223a8693"
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "Let's have a quick qualitative evaluation of the model, by taking a sample from the dataset that corresponds to a positive label. Run your generation similarly as you were running your model from `transformers`:",
   "id": "c0866c0b105293e6"
  },
  {
   "metadata": {},
   "cell_type": "code",
   "outputs": [],
   "execution_count": null,
   "source": [
    "model.eval()\n",
    "input_text = \"In January-September 2009 , the Group 's net interest income increased to EUR 112.4 mn from EUR 74.3 mn in January-September 2008 .\"\n",
    "inputs = tokenizer(input_text, return_tensors=\"pt\")\n",
    "\n",
    "outputs = model.generate(input_ids=inputs[\"input_ids\"], max_new_tokens=10)\n",
    "\n",
    "print(\"input sentence: \", input_text)\n",
    "print(\" output prediction: \", tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True))"
   ],
   "id": "1fd9e1c54d0cad50"
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "## Share your adapters on 🤗 Hub",
   "id": "b049a7457e34ab1d"
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "Once you have trained your adapter, you can easily share it on the Hub using the method `push_to_hub` . Note that only the adapter weights and config will be pushed",
   "id": "955c80773a07d504"
  },
  {
   "metadata": {},
   "cell_type": "code",
   "outputs": [],
   "execution_count": null,
   "source": [
    "from huggingface_hub import notebook_login\n",
    "\n",
    "notebook_login()"
   ],
   "id": "1b0cc625be0f5888"
  },
  {
   "metadata": {},
   "cell_type": "code",
   "outputs": [],
   "execution_count": null,
   "source": "model.push_to_hub(\"ybelkada/flan-t5-large-financial-phrasebank-lora\", use_auth_token=True)",
   "id": "45ca9d72c27c5b5"
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "## Load your adapter from the Hub",
   "id": "1d533cb66015a3b1"
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "You can load the model together with the adapter with few lines of code! Check the snippet below to load the adapter from the Hub and run the example evaluation!",
   "id": "c300b1b726e1f734"
  },
  {
   "metadata": {},
   "cell_type": "code",
   "outputs": [],
   "execution_count": null,
   "source": [
    "import torch\n",
    "from peft import PeftModel, PeftConfig\n",
    "from transformers import AutoModelForSeq2SeqLM, AutoTokenizer\n",
    "\n",
    "peft_model_id = \"ybelkada/flan-t5-large-financial-phrasebank-lora\"\n",
    "config = PeftConfig.from_pretrained(peft_model_id)\n",
    "\n",
    "model = AutoModelForSeq2SeqLM.from_pretrained(config.base_model_name_or_path, torch_dtype=\"auto\", device_map=\"auto\")\n",
    "tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)\n",
    "\n",
    "# Load the Lora model\n",
    "model = PeftModel.from_pretrained(model, peft_model_id)"
   ],
   "id": "fdd60d1ecc64bf95"
  },
  {
   "metadata": {},
   "cell_type": "code",
   "outputs": [],
   "execution_count": null,
   "source": [
    "model.eval()\n",
    "input_text = \"In January-September 2009 , the Group 's net interest income increased to EUR 112.4 mn from EUR 74.3 mn in January-September 2008 .\"\n",
    "inputs = tokenizer(input_text, return_tensors=\"pt\")\n",
    "\n",
    "outputs = model.generate(input_ids=inputs[\"input_ids\"], max_new_tokens=10)\n",
    "\n",
    "print(\"input sentence: \", input_text)\n",
    "print(\" output prediction: \", tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True))"
   ],
   "id": "cfa3285a4ad92f42"
  }
 ],
 "metadata": {},
 "nbformat": 5,
 "nbformat_minor": 9
}
