{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "-J8iGzLf4rUJ"
      },
      "source": [
        "# GRPO EssentialAI/rnj-1-instruct with QLoRA using TRL\n",
        "\n",
        "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/trl/blob/main/examples/notebooks/grpo_rnj_1_instruct.ipynb)\n",
        "\n",
        "![trl banner](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/trl_banner_dark.png)\n",
        "\n",
        "\n",
        "With [**Transformers Reinforcement Learning (TRL)**](https://github.com/huggingface/trl), you can fine-tune cutting edge large language models. It comes with support for quantized parameter efficient fine-tuning technique **QLoRA**, so we can use Colab to fine-tune models like [EssentialAI/rnj-1-instruct](https://huggingface.co/collections/EssentialAI/rnj-1).\n",
        "\n",
        "\n",
        "- [TRL GitHub Repository](https://github.com/huggingface/trl) — star us to support the project!  \n",
        "- [Official TRL Examples](https://huggingface.co/docs/trl/example_overview)  \n",
        "- [Community Tutorials](https://huggingface.co/docs/trl/community_tutorials)\n",
        "\n",
        "In this notebook, we'll add reasoning capabilities to the model, teaching it to generate reasoning traces (`<think></think>`) before giving us the final answer (`<answer></answer>`)."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "NvrzGRnu48Vz"
      },
      "source": [
        "## Install dependencies\n",
        "\n",
        "We'll install **TRL** with the **PEFT** extra, which ensures all main dependencies such as **Transformers** and **PEFT** (a package for parameter-efficient fine-tuning, e.g., LoRA/QLoRA) are included. Additionally, we'll install **trackio** to log and monitor our experiments, and **bitsandbytes** to enable quantization of LLMs, reducing memory consumption for both inference and training."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "8VOdRz9fgFa8"
      },
      "outputs": [],
      "source": [
        "!pip install -Uq \"trl[peft]\" bitsandbytes trackio math_verify"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "gpzI6omi7728"
      },
      "source": [
        "### Log in to Hugging Face\n",
        "\n",
        "Log in to your **Hugging Face** account to save your fine-tuned model, track your experiment results directly on the Hub or access gated models. You can find your **access token** on your [account settings page](https://huggingface.co/settings/tokens)."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "d3j3BsdQgFa8"
      },
      "outputs": [],
      "source": [
        "from huggingface_hub import notebook_login\n",
        "\n",
        "notebook_login()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "V_Zylc4t79-n"
      },
      "source": [
        "## Load dataset\n",
        "\n",
        "\n",
        "We'll load the [**AI-MO/NuminaMath-TIR**](https://huggingface.co/datasets/AI-MO/NuminaMath-TIR) dataset from the Hugging Face Hub using the `datasets` library.\n",
        "\n",
        "This dataset contains maths problems, along with the solution in thinking format specially tailored for LLMs. By training our model with this dataset, it'll improve its maths and thinking reasoning.\n",
        "\n",
        "> We only use a subset for educational purposes. In a real scenario, we'd use the complete dataset."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "YSuLNZAmgFa9"
      },
      "outputs": [],
      "source": [
        "from datasets import load_dataset\n",
        "\n",
        "dataset_id = 'AI-MO/NuminaMath-TIR'\n",
        "train_dataset = load_dataset(dataset_id, split='train[:5%]')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "gVV7RoRN8zk5"
      },
      "source": [
        "In addition to the current columns, we also include a custom system prompt to tell the model how we'd like the generation.\n",
        "\n",
        "This system prompt is an adapted version of the original one extracted from **DeepSeek R1**. For additional background, see [this previous recipe](https://huggingface.co/learn/cookbook/fine_tuning_llm_grpo_trl). We extend the prompt with **examples** and a **more explicit, verbose formulation** to make the desired behavior easier for the model to learn. Depending on your goals, you may further enrich the prompt to simplify learning, or intentionally shorten and harden it to encourage more robust and generalizable behavior.\n",
        "\n",
        "We convert the dataset samples into conversation samples, including the system prompt and problem description per sample, since this is how the GRPO trainer expects them.\n",
        "\n",
        "We also set `padding_side=\"left\"` to ensure that generated completions during training are concatenated directly after the prompt, which is essential for GRPO to correctly compare token-level probabilities between preferred and rejected responses."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "vr9t-9Z5gFa9"
      },
      "outputs": [],
      "source": [
        "SYSTEM_PROMPT = \"\"\"A conversation between User and Assistant. The user asks a question, and the Assistant solves it.\n",
        "The assistant first thinks about the reasoning process in the mind and then provides the user with the answer.\n",
        "The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags.\n",
        "Use exactly one <think>...</think> block followed by exactly one <answer>...</answer> block.\n",
        "\n",
        "Examples:\n",
        "\n",
        "User: What is 2 + 2?\n",
        "Assistant:\n",
        "<think>\n",
        "I will add 2 and 2 together.\n",
        "</think>\n",
        "<answer>4</answer>\n",
        "\n",
        "User: What is 3 × 5?\n",
        "Assistant:\n",
        "<think>\n",
        "I will multiply 3 by 5.\n",
        "</think>\n",
        "<answer>15</answer>\n",
        "\n",
        "User: Find the GCD of 12 and 18.\n",
        "Assistant:\n",
        "<think>\n",
        "I will list the divisors of 12 and 18 and find the greatest one they have in common.\n",
        "</think>\n",
        "<answer>6</answer>\n",
        "\"\"\"\n",
        "\n",
        "def make_conversation(example):\n",
        "    return {\n",
        "        \"prompt\": [\n",
        "            {\"role\": \"system\", \"content\": SYSTEM_PROMPT},\n",
        "            {\"role\": \"user\", \"content\": example[\"problem\"]},\n",
        "        ],\n",
        "    }\n",
        "\n",
        "train_dataset = train_dataset.map(make_conversation)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "5txAuMAa8ock"
      },
      "source": [
        "Let's review one example to understand the internal structure:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "jZtkB0D9gFa9"
      },
      "outputs": [],
      "source": [
        "print(train_dataset[0])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "FtdKjmyFZImL"
      },
      "source": [
        "And remove the columns that are not needed for training:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Ai4F1GaPgFa-"
      },
      "outputs": [],
      "source": [
        "train_dataset = train_dataset.remove_columns(['messages', 'problem'])\n",
        "print(train_dataset)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "YY3uMp909Eqy"
      },
      "source": [
        "## Load model and configure LoRA/QLoRA\n",
        "\n",
        "This notebook can be used with two fine-tuning methods. By default, it is set up for **QLoRA**, which includes quantization using `BitsAndBytesConfig`. If you prefer to use standard **LoRA** without quantization, simply comment out the `BitsAndBytesConfig` configuration."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "DSKcUQ9RgFa-"
      },
      "outputs": [],
      "source": [
        "from transformers import AutoModelForCausalLM, BitsAndBytesConfig\n",
        "import torch\n",
        "\n",
        "model_name = \"EssentialAI/rnj-1-instruct\"\n",
        "\n",
        "model = AutoModelForCausalLM.from_pretrained(\n",
        "    model_name,\n",
        "    dtype=\"auto\",\n",
        "    device_map=\"auto\",\n",
        "    quantization_config=BitsAndBytesConfig(\n",
        "        load_in_4bit=True,\n",
        "        bnb_4bit_use_double_quant=True,\n",
        "        bnb_4bit_quant_type=\"nf4\",\n",
        "        bnb_4bit_compute_dtype=torch.float16\n",
        "    ),\n",
        ")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "WZGf-GF09Gsc"
      },
      "source": [
        "The following cell defines LoRA (or QLoRA if needed). When training with LoRA/QLoRA, we use a **base model** (the one selected above) and, instead of modifying its original weights, we fine-tune a **LoRA adapter**, a lightweight layer that enables efficient and memory-friendly training. The **`target_modules`** specify which parts of the model (e.g., attention or projection layers) will be adapted by LoRA during fine-tuning."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "nMMlDxJSgFa-"
      },
      "outputs": [],
      "source": [
        "from peft import LoraConfig\n",
        "\n",
        "# You may need to update `target_modules` depending on the architecture of your chosen model.\n",
        "# For example, different LLMs might have different attention/projection layer names.\n",
        "peft_config = LoraConfig(\n",
        "    r=32,\n",
        "    lora_alpha=32,\n",
        "    target_modules = [\"q_proj\", \"k_proj\", \"v_proj\", \"o_proj\", \"gate_proj\", \"up_proj\", \"down_proj\",],\n",
        ")\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "mDq4V6dN9MGk"
      },
      "source": [
        "## Train model\n",
        "\n",
        "We'll configure **GRPO** using `GRPOConfig`, keeping the parameters minimal so the training fits on a Colab instance. You can adjust these settings depending on the resources available. For full details on all available parameters, check the [TRL GRPOConfig documentation](https://huggingface.co/docs/trl/sft_trainer#trl.GRPOConfig).\n",
        "\n",
        "First, we need to define the rewards functions that the training algorithm will use to improve the model. In this case, we'll include just one reward function.\n",
        "We'll use a format reward that will reward the model when the output includes `<think>` and `<answer>` tags. This is a simplification of the pipeline for educational purposes, but in a real scenario, you'd at least all need a reward function to check the correctness of the model answer. The function has been extracted from [here](https://github.com/huggingface/open-r1/blob/main/src/open_r1/rewards.py).\n",
        "\n",
        "> 💡 **Note**:  \n",
        "> You can further refine this reward by making it more granular. For example, assigning partial rewards when `<think>` and `<answer>` appear independently, or when they are present but incorrectly ordered. This can make the learning signal denser and speed up early training. However, overly simplifying the reward may reduce robustness, even if it helps the model converge faster. In practice, there is a trade-off between ease of learning and the generalization quality of the final model."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Rtx5owCRgFa-"
      },
      "outputs": [],
      "source": [
        "import re\n",
        "\n",
        "def format_reward(completions, **kwargs):\n",
        "    \"\"\"Reward function that checks if the reasoning process is enclosed within <think> and </think> tags, while the final answer is enclosed within <answer> and </answer> tags.\"\"\"\n",
        "    pattern = r\"<think>.*?</think>.*?<answer>.*?</answer>\"\n",
        "\n",
        "    matches = []\n",
        "    for item in completions:\n",
        "        if isinstance(item, list):\n",
        "            text = item[0]['content']\n",
        "        else:\n",
        "            text = item\n",
        "        match = re.match(pattern, text, re.DOTALL | re.MULTILINE)\n",
        "        matches.append(match)\n",
        "\n",
        "    return [1.0 if match else 0.0 for match in matches]"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "9xBL7Rni9LZb"
      },
      "source": [
        "After defining the reward function(s), we can define the `GRPOConfig`. You can adapt the values in the config depending on your training setting and even fit the training in more constrained setups like free Colab (T4)."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "rJ0VfG3wgFa-"
      },
      "outputs": [],
      "source": [
        "from trl import GRPOConfig\n",
        "\n",
        "output_dir = \"EssentialAI-rnj-1-instruct-trl-grpo\"\n",
        "\n",
        "# Configure training arguments using GRPOConfig\n",
        "training_args = GRPOConfig(\n",
        "    learning_rate=2e-5,                                   # Learning rate used during traing\n",
        "    num_train_epochs=1,                                   # Number of full dataset passes. For testing, use `max_steps` instead\n",
        "    #max_steps=100,\n",
        "\n",
        "    # Parameters that control the data preprocessing\n",
        "    per_device_train_batch_size=8,\n",
        "    max_completion_length=256, # default: 256             # Max completion length produced during training\n",
        "    num_generations=8, # default: 8                       # Number of generations produced during training for comparison\n",
        "    max_prompt_length=512,  # default: 512                # Max prompt length of the input prompt used for generation during training\n",
        "\n",
        "    # Parameters related to reporting and saving\n",
        "    output_dir=output_dir,                                # Where to save model checkpoints and logs\n",
        "    logging_steps=10,                                     # Log training metrics every N steps\n",
        "    report_to=\"trackio\",                                  # Experiment tracking tool\n",
        "    trackio_space_id = output_dir,                        # HF Space where you trackio will be\n",
        "\n",
        "    # Hub integration\n",
        "    push_to_hub=True,                                     # Push the resulted model to the Hub\n",
        "    log_completions=True,                                 # Log completions during training\n",
        ")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "O0q3myQg927v"
      },
      "source": [
        "Configure the GRPO Trainer. We pass the previously configured `training_args`. We don't use eval dataset to maintain memory usage low but you can configure it."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "aW7Gi4nXgFa-"
      },
      "outputs": [],
      "source": [
        "from trl import GRPOTrainer\n",
        "\n",
        "trainer = GRPOTrainer(\n",
        "    model=model,\n",
        "    reward_funcs=[format_reward],\n",
        "    args=training_args,\n",
        "    train_dataset=train_dataset,\n",
        "    peft_config=peft_config,\n",
        ")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "kQC7Q5kg95xq"
      },
      "source": [
        "Show memory stats before training"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "OJdVlC_mgFa_"
      },
      "outputs": [],
      "source": [
        "gpu_stats = torch.cuda.get_device_properties(0)\n",
        "start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)\n",
        "max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)\n",
        "\n",
        "print(f\"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.\")\n",
        "print(f\"{start_gpu_memory} GB of memory reserved.\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "YazYtLAe97Dc"
      },
      "source": [
        "And train!"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Mtv8s7rBgFa_"
      },
      "outputs": [],
      "source": [
        "trainer_stats = trainer.train()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "SmcYN5yW99IP"
      },
      "source": [
        "Show memory stats after training"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "-ROfX8e9gFa_"
      },
      "outputs": [],
      "source": [
        "used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)\n",
        "used_memory_for_lora = round(used_memory - start_gpu_memory, 3)\n",
        "used_percentage = round(used_memory / max_memory * 100, 3)\n",
        "lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)\n",
        "\n",
        "print(f\"{trainer_stats.metrics['train_runtime']} seconds used for training.\")\n",
        "print(f\"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.\")\n",
        "print(f\"Peak reserved memory = {used_memory} GB.\")\n",
        "print(f\"Peak reserved memory for training = {used_memory_for_lora} GB.\")\n",
        "print(f\"Peak reserved memory % of max memory = {used_percentage} %.\")\n",
        "print(f\"Peak reserved memory for training % of max memory = {lora_percentage} %.\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "saarW87Y9_-R"
      },
      "source": [
        "## Saving fine tuned model\n",
        "\n",
        "In this step, we save the fine-tuned model both **locally** and to the **Hugging Face Hub** using the credentials from your account."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "09zYXJ3GgFa_"
      },
      "outputs": [],
      "source": [
        "trainer.save_model(output_dir)\n",
        "trainer.push_to_hub(dataset_name=dataset_id)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "nfqvO0qw-OvS"
      },
      "source": [
        "## Load the fine-tuned model and run inference\n",
        "\n",
        "Now, let's test our fine-tuned model by loading the **LoRA/QLoRA adapter** and performing **inference**. We'll start by loading the **base model**, then attach the adapter to it, creating the final fine-tuned model ready for evaluation."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "9Yk9RAABgFa_"
      },
      "outputs": [],
      "source": [
        "output_dir = 'sergiopaniego/EssentialAI-rnj-1-instruct-trl-grpo'\n",
        "model_name = \"EssentialAI/rnj-1-instruct\""
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "CdzlQcCAgFa_"
      },
      "outputs": [],
      "source": [
        "from transformers import AutoModelForCausalLM, AutoTokenizer\n",
        "from peft import PeftModel\n",
        "\n",
        "base_model = model_name\n",
        "adapter_model = f\"{output_dir}\" # Replace with your HF username or organization\n",
        "\n",
        "model = AutoModelForCausalLM.from_pretrained(base_model, dtype=\"auto\", device_map=\"auto\")\n",
        "model = PeftModel.from_pretrained(model, adapter_model)\n",
        "\n",
        "tokenizer = AutoTokenizer.from_pretrained(base_model)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "LZgjlAu-gFa_"
      },
      "outputs": [],
      "source": [
        "train_dataset[0]"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "gjY6TqQHgFa_"
      },
      "outputs": [],
      "source": [
        "from datasets import load_dataset\n",
        "\n",
        "dataset_id = 'AI-MO/NuminaMath-TIR'\n",
        "train_dataset = load_dataset(dataset_id, split='train[:5%]')\n",
        "\n",
        "problem = train_dataset[0]['problem']\n",
        "\n",
        "messages = [\n",
        "    {\n",
        "        \"role\": \"system\", \"content\": [\n",
        "            {\"type\": \"text\", \"text\": SYSTEM_PROMPT}\n",
        "        ]\n",
        "    },\n",
        "    {\n",
        "        \"role\": \"user\",\n",
        "        \"content\": [\n",
        "            {\"type\": \"text\", \"text\": problem},\n",
        "        ],\n",
        "    },\n",
        "]"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "eaVubGYmgFa_"
      },
      "outputs": [],
      "source": [
        "messages"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "2M6Xh4JMgFa_"
      },
      "outputs": [],
      "source": [
        "input_ids = tokenizer.apply_chat_template(\n",
        "    messages,\n",
        "    add_generation_prompt=True,\n",
        "    return_tensors=\"pt\"\n",
        ").to(model.device)\n",
        "\n",
        "# --- Generate Prediction --- #\n",
        "print(\"Generating prediction...\")\n",
        "output_ids = model.generate(\n",
        "    input_ids,\n",
        "    max_new_tokens=50,\n",
        "    pad_token_id=tokenizer.eos_token_id,\n",
        "    do_sample=True,\n",
        "    temperature=0.2,\n",
        "    top_p=0.95\n",
        ")\n",
        "\n",
        "response = tokenizer.decode(output_ids[0][input_ids.shape[-1]:], skip_special_tokens=True)\n",
        "print(response)"
      ]
    }
  ],
  "metadata": {
    "accelerator": "GPU",
    "colab": {
      "gpuType": "A100",
      "provenance": []
    },
    "language_info": {
      "name": "python"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
