{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Fine-tune HF Model with Your Custom Training Loop"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sys\n",
    "sys.path.append(\"../\")\n",
    "\n",
    "from tqdm.auto import tqdm\n",
    "import torch\n",
    "from transformers import AutoTokenizer\n",
    "from zo2 import (\n",
    "    ZOConfig,\n",
    "    zo_hf_init,\n",
    ")\n",
    "from zo2.utils import seed_everything"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Hyperparameter\n",
    "zo_method = \"zo2\"\n",
    "eval_mode = False\n",
    "model_name = \"facebook/opt-2.7b\"\n",
    "verbose = True\n",
    "max_steps = 100\n",
    "learning_rate = 1e-5\n",
    "weight_decay = 1e-1\n",
    "zo_eps = 1e-3\n",
    "seed = 42\n",
    "offloading_device = \"cpu\"\n",
    "working_device = \"cuda:0\"\n",
    "use_cache = True\n",
    "max_new_tokens = 50\n",
    "temperature = 1.0\n",
    "seed_everything(seed)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ZO steps\n",
    "zo_config = ZOConfig(\n",
    "    method=\"mezo-sgd\", \n",
    "    zo2=zo_method==\"zo2\", \n",
    "    lr=learning_rate,\n",
    "    weight_decay=weight_decay,\n",
    "    eps=zo_eps,\n",
    "    offloading_device=offloading_device,\n",
    "    working_device=working_device,\n",
    ")\n",
    "\n",
    "# Load ZO model\n",
    "with zo_hf_init(zo_config):\n",
    "    from transformers import OPTForCausalLM\n",
    "    model = OPTForCausalLM.from_pretrained(model_name)\n",
    "    model.zo_init(zo_config)\n",
    "if zo_method != \"zo2\": \n",
    "    model = model.to(working_device)\n",
    "print(f\"Check if zo2 init correctly: {hasattr(model, 'zo_training')}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Prepare some data\n",
    "dataset = \"\"\"\n",
    "    What is ZO2? \n",
    "    ZO2 is an innovative framework specifically designed to enhance the fine-tuning of large language models (LLMs) using zeroth-order (ZO) optimization techniques and advanced offloading technologies. \n",
    "    This framework is particularly tailored for setups with limited GPU memory, enabling the fine-tuning of models that were previously unmanageable due to hardware constraints. \n",
    "    As the scale of Large Language Models (LLMs) continues to grow, reaching parameter counts in the hundreds of billions, managing GPU memory resources effectively becomes crucial. \n",
    "    Efficient GPU memory management is crucial not only because it directly influences model performance and training speed, but also because GPU memory is both expensive and limited in quantity. \n",
    "    However, this creates a significant challenge in handling ever-larger models within the physical constraints of current hardware technologies. \n",
    "    CPU offloading has become a crucial technique for overcoming this challenge. \n",
    "    It involves transferring computations and data from the GPU to the CPU, specifically targeting data or parameters that are less frequently accessed. \n",
    "    By offloading these inactive tensors of the neural network, CPU offloading effectively alleviates the memory and computational pressures on GPUs. \n",
    "    While CPU offloading has been commonly applied in inference to manage memory-intensive tasks, its application in training, especially fine-tuning, remains less explored. \n",
    "    Recently, some works have tried to introduce CPU offloading into LLM training. \n",
    "    However, they are typically constrained by the capabilities of first-order optimizers such as SGD and Adaptive Moment Estimation (AdamW), and limited GPU memory, restricting large-scale model scalability on single GPU systems. \n",
    "    Using first-order optimizers introduces inefficiencies in CPU offloading: Multiple communication operations during the training of LLMs necessitate offloading the same data twice—once for each pass. \n",
    "    This redundancy not only doubles the communication volume between the CPU and GPU but also introduces significant latency due to repetitive data transfers. \n",
    "    Furthermore, both parameters and activations are required in the backward pass to complete gradient computations. \n",
    "    This means that parameters and activation values must be offloaded during each forward pass and re-uploaded to the GPU for the backward pass, increasing the volume of data transferred, which severely impacts training throughput. \n",
    "    On the other hand, zeroth-order (ZO) methods offer a novel approach to fine-tuning LLMs. \n",
    "    These methods utilize dual forward passes to estimate parameter gradients and subsequently update parameters. \n",
    "    This approach eliminates the traditional reliance on backward passes, thereby streamlining the training process by significantly reducing the number of computational steps required. \n",
    "    Based on these observations, we conjecture that ZO's architecture is particularly well-suited for CPU offloading strategies. \n",
    "    By eliminating backward passes and the need to store activation values, it can significantly reduce GPU memory demands through efficient parameter offloading. \n",
    "    However, despite these advantages, ZO training via CPU offloading introduces new challenges, particularly in the realm of CPU-to-GPU communication. \n",
    "    Transferring parameters between the CPU and GPU, which is crucial for maintaining gradient computation and model updates, becomes a critical bottleneck. \n",
    "    Although ZO methods inherently extend computation times because of the dual forward passes, potentially allowing for better overlap between computation and communication, there remain significant inefficiencies. \n",
    "    The necessity to upload parameters to the GPU for upcoming computations introduces a large volume of communications. To tackle the inefficiencies highlighted, we introduce ZO2, a novel framework specifically designed for ZO fine-tuning in LLMs with CPU offloading. \n",
    "    This framework utilizes the unique dual forward pass architecture of ZO methods to optimize interactions between CPU and GPU, significantly enhancing both computational and communication efficiency. \n",
    "    By building a high-performance dynamic scheduler, ZO2 achieves substantial overlaps in communication and computation. \n",
    "    These innovations make it feasible to fine-tune extremely large models, such as the OPT-175B, with over 175 billion parameters, on a single GPU equipped with just 18GB of memory usage—a capability previously unattainable with conventional methods. \n",
    "    Additionally, our efficient framework operates without any extra time cost and decreases in accuracy compared to standard ZO methodologies.\"\"\"\n",
    "tokenizer = AutoTokenizer.from_pretrained(model_name)\n",
    "data_batch = tokenizer(dataset, add_special_tokens=True, return_tensors='pt').input_ids.to(working_device)\n",
    "T = min(data_batch.shape[1] - 1, model.config.max_position_embeddings)\n",
    "print(f\"Fine-tuning model {model_name} with {T} tokens dataset: \\n{dataset}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Training loop\n",
    "for i in tqdm(range(max_steps)):\n",
    "    model.zo_train()\n",
    "    loss = model(input_ids=data_batch, labels=data_batch)\n",
    "\n",
    "    # eval\n",
    "    if eval_mode:\n",
    "        if i==0:\n",
    "            tqdm.write(\"Warning: please notice that ZO2 does not optimize the evaluation, so it may be very slow.\")\n",
    "        model.zo_eval()\n",
    "        output = model(input_ids=data_batch, labels=data_batch)\n",
    "        res = \"Iteration {}, train loss: {}, projected grad: {}, eval loss: {}\"\n",
    "        tqdm.write(res.format(i, loss, model.opt.projected_grad, output[\"loss\"]))\n",
    "    else:\n",
    "        res = \"Iteration {}, train loss: {}, projected grad: {}\"\n",
    "        tqdm.write(res.format(i, loss, model.opt.projected_grad))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# inference\n",
    "print(\"Doing inference...\")\n",
    "print(\"Warning: please notice that ZO2 does not optimize the inference, so it may be very slow.\")\n",
    "model.zo_eval()\n",
    "prompt = \"What is ZO2 and how ZO2 enhance the fine-tuning of large language models?\"\n",
    "inputs = tokenizer(prompt, return_tensors='pt').to(working_device)\n",
    "inputs = {\"input_ids\": inputs.input_ids}\n",
    "for _ in tqdm(range(max_new_tokens)):\n",
    "    outputs = model(**inputs, return_dict=True)\n",
    "    next_token_logits = outputs.logits[:, -1, :]\n",
    "    if temperature == 1.0:\n",
    "        next_token = torch.argmax(next_token_logits, dim=-1).unsqueeze(-1)\n",
    "    else:\n",
    "        scaled_logits = next_token_logits / temperature\n",
    "        probs = torch.nn.functional.softmax(scaled_logits, dim=-1)\n",
    "        next_token = torch.multinomial(probs, num_samples=1)\n",
    "    inputs = torch.cat([inputs[\"input_ids\"], next_token], dim=-1)\n",
    "    generated_text = tokenizer.decode(inputs[0])\n",
    "    inputs = {\"input_ids\": inputs}\n",
    "print(f\"Question: {prompt}\")\n",
    "print(f\"Response: {generated_text[len(prompt)+4:]}...\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "mezo",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
