{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Temporary reading of pilot llava results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "df = pd.read_csv(\"llava-100-test.csv\")\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Batch inference with Gemma/PaliGemma with HF + GCP"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Gemma is a family of lightweight, state-of-the-art open models built from the same research and technology used to create the Gemini models, developed by Google DeepMind and other teams across Google. Text Generation Inference (TGI) is a toolkit developed by Hugging Face for deploying and serving LLMs, with high performance text generation. And, Google Vertex AI is a Machine Learning (ML) platform that lets you train and deploy ML models and AI applications, and customize large language models (LLMs) for use in your AI-powered applications. This example showcases how to deploy any supported text-generation model, in this case [`google/gemma-7b-it`](https://huggingface.co/google/gemma-7b-it), from the Hugging Face Hub on Vertex AI using the TGI DLC available in Google Cloud Platform (GCP)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![`google/gemma-7b-it` in the Hugging Face Hub](./assets/model-in-hf-hub.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup / Configuration"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, you need to install `gcloud` in your local machine, which is the command-line tool for Google Cloud, following the instructions at [Cloud SDK Documentation - Install the gcloud CLI](https://cloud.google.com/sdk/docs/install)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then, you also need to install the `google-cloud-aiplatform` Python SDK, required to programmatically create the Vertex AI model, register it, acreate the endpoint, and deploy it on Vertex AI."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install --upgrade google-cloud-aiplatform"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Optionally, to ease the usage of the commands within this tutorial, you need to set the following environment variables for GCP:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%env PROJECT_ID=multimodal-representations\n",
    "%env LOCATION=us-central1\n",
    "#%env CONTAINER_URI=us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu121.2-2.ubuntu2204.py310:latest\n",
    "#%env CONTAINER_URI=us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-llava-serve\n",
    "%env CONTAINER_URI=us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu124.2-3.ubuntu2204.py311\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then you need to login into your GCP account and set the project ID to the one you want to use to register and deploy the models on Vertex AI."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!gcloud auth login\n",
    "!gcloud config set project $PROJECT_ID"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once you are logged in, you need to enable the necessary service APIs in GCP, such as the Vertex AI API, the Compute Engine API, and Google Container Registry related APIs.\n",
    "\n",
    "**Warning:** Make sure, manually, that these are disabled after running exps (even though we will explicitly write code to disable them)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!gcloud services enable aiplatform.googleapis.com\n",
    "!gcloud services enable compute.googleapis.com\n",
    "!gcloud services enable container.googleapis.com\n",
    "!gcloud services enable containerregistry.googleapis.com\n",
    "!gcloud services enable containerfilesystem.googleapis.com"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Register model on Vertex AI"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once everything is set up, you can already initialize the Vertex AI session via the `google-cloud-aiplatform` Python SDK as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "from google.cloud import aiplatform\n",
    "\n",
    "aiplatform.init(\n",
    "    project=os.getenv(\"PROJECT_ID\"),\n",
    "    location=os.getenv(\"LOCATION\"),\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since Gemma models are gated, you need to login into your Hugging Face Hub account with a read-access token either fine-grained with access to the gated model, or just overall read-access to your account. More information on how to generate a read-only access token for the Hugging Face Hub in the instructions at <https://huggingface.co/docs/hub/en/security-tokens>."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install --upgrade --quiet huggingface_hub"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from huggingface_hub import interpreter_login\n",
    "\n",
    "interpreter_login()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then you can already \"upload\" the model i.e. register the model on Vertex AI. It is not an upload per se, since the model will be automatically downloaded from the Hugging Face Hub in the Hugging Face DLC for TGI on startup via the `MODEL_ID` environment variable, so what is uploaded is only the configuration, not the model weights.\n",
    "\n",
    "Before going into the code, let's quickly review the arguments provided to the `upload` method:\n",
    "\n",
    "* **`display_name`** is the name that will be shown in the Vertex AI Model Registry.\n",
    "\n",
    "* **`serving_container_image_uri`** is the location of the Hugging Face DLC for TGI that will be used for serving the model.\n",
    "\n",
    "* **`serving_container_environment_variables`** are the environment variables that will be used during the container runtime, so these are aligned with the environment variables defined by `text-generation-inference`, which are analog to the [`text-generation-launcher` arguments](https://huggingface.co/docs/text-generation-inference/en/basic_tutorials/launcher). Additionally, the Hugging Face DLCs for TGI also capture the `AIP_` environment variables from Vertex AI as in [Vertex AI Documentation - Custom container requirements for prediction](https://cloud.google.com/vertex-ai/docs/predictions/custom-container-requirements).\n",
    "\n",
    "    * `MODEL_ID` is the identifier of the model in the Hugging Face Hub. To explore all the supported models you can check <https://huggingface.co/models?sort=trending&other=text-generation-inference>.\n",
    "    * `NUM_SHARD` is the number of shards to use if you don't want to use all GPUs on a given machine e.g. if you have two GPUs but you just want to use one for TGI then `NUM_SHARD=1`, otherwise it matches the `CUDA_VISIBLE_DEVICES`.\n",
    "    * `MAX_INPUT_TOKENS` is the maximum allowed input length (expressed in number of tokens), the larger it is, the larger the prompt can be, but also more memory will be consumed.\n",
    "    * `MAX_TOTAL_TOKENS` is the most important value to set as it defines the \"memory budget\" of running clients requests, the larger this value, the larger amount each request will be in your RAM and the less effective batching can be.\n",
    "    * `MAX_BATCH_PREFILL_TOKENS` limits the number of tokens for the prefill operation, as it takes the most memory and is compute bound, it is interesting to limit the number of requests that can be sent.\n",
    "    * `HUGGING_FACE_HUB_TOKEN` is the Hugging Face Hub token, required as [`google/gemma-7b-it`](https://huggingface.co/google/gemma-7b-it) is a gated model.\n",
    "\n",
    "* (optional) **`serving_container_ports`** is the port where the Vertex AI endpoint will be exposed, by default 8080.\n",
    "\n",
    "For more information on the supported `aiplatform.Model.upload` arguments, check its Python reference at https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform.Model#google_cloud_aiplatform_Model_upload."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(os.getenv(\"CONTAINER_URI\"))\n",
    "#huggingface-text-generation-inference-cu121.2-2.ubuntu2204.py310"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!gcloud auth configure-docker us-docker.pkg.dev"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from huggingface_hub import get_token\n",
    "\n",
    "HF_MODEL = \"google/paligemma-3b-mix-224\" # change this\n",
    "DISPLAY_NAME = \"paligemma\"\n",
    "#HF_MODEL = \"Salesforce/blip2-opt-2.7b\"\n",
    "#HF_MODEL = \"liuhaotian/llava-v1.5-13b\"\n",
    "#HF_MODEL = \"llava-hf/llava-v1.6-mistral-7b-hf\"\n",
    "#DISPLAY_NAME = \"llava-1.6\"\n",
    "\n",
    "\n",
    "model = aiplatform.Model.upload(\n",
    "    display_name=DISPLAY_NAME,\n",
    "    serving_container_image_uri=\"us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu121.2-2.ubuntu2204.py310\",\n",
    "    serving_container_environment_variables={\n",
    "        \"MODEL_ID\": HF_MODEL,\n",
    "        \"HUGGING_FACE_HUB_TOKEN\": get_token(),\n",
    "        \"HF_HUB_ENABLE_HF_TRANSFER\": \"1\",\n",
    "        \"NUM_SHARD\": \"1\",\n",
    "    },\n",
    "    serving_container_ports=[8080],\n",
    ")\n",
    "model.wait()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "model = aiplatform.Model.upload(\n",
    "    display_name=\"Llama-Vision-11B\",\n",
    "    serving_container_image_uri=os.getenv(\"CONTAINER_URI\"),\n",
    "    serving_container_environment_variables={\n",
    "        \"MODEL_ID\": \"meta-llama/Llama-3.2-11B-Vision-Instruct\",\n",
    "        \"NUM_SHARD\": \"2\",\n",
    "        \"MAX_INPUT_TOKENS\": \"512\",\n",
    "        \"MAX_TOTAL_TOKENS\": \"1024\",\n",
    "        \"MAX_BATCH_PREFILL_TOKENS\": \"1512\",\n",
    "        \"HF_HUB_ENABLE_HF_TRANSFER\": \"1\",\n",
    "        \"HUGGING_FACE_HUB_TOKEN\": get_token(),\n",
    "        \"MESSAGES_API_ENABLED\": \"true\",\n",
    "    },\n",
    "    serving_container_ports=[8080],\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![Model on Vertex AI Model Registry](./assets/vertex-ai-model.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Deploy model on Vertex AI"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After the model is registered on Vertex AI, you need to define the endpoint that you want to deploy the model to, and then link the model deployment to that endpoint resource.\n",
    "\n",
    "To do so, you need to call the method `aiplatform.Endpoint.create` to create a new Vertex AI endpoint resource (which is not linked to a model or anything usable yet)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "endpoint = aiplatform.Endpoint.create(display_name=f\"{DISPLAY_NAME}-endpoint\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![Vertex AI Endpoint created](./assets/vertex-ai-endpoint.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now you can deploy the registered model in an endpoint on Vertex AI.\n",
    "\n",
    "The `deploy` method will link the previously created endpoint resource with the model that contains the configuration of the serving container, and then, it will deploy the model on Vertex AI in the specified instance.\n",
    "\n",
    "Before going into the code, let's quickly review the arguments provided to the `deploy` method:\n",
    "\n",
    "- **`endpoint`** is the endpoint to deploy the model to, which is optional, and by default will be set to the model display name with the `_endpoint` suffix.\n",
    "- **`machine_type`**, **`accelerator_type`** and **`accelerator_count`** are arguments that define which instance to use, and additionally, the accelerator to use and the number of accelerators, respectively. The `machine_type` and the `accelerator_type` are tied together, so you will need to select an instance that supports the accelerator that you are using and vice-versa. More information about the different instances at [Compute Engine Documentation - GPU machine types](https://cloud.google.com/compute/docs/gpus), and about the `accelerator_type` naming at [Vertex AI Documentation - MachineSpec](https://cloud.google.com/vertex-ai/docs/reference/rest/v1/MachineSpec).\n",
    "\n",
    "For more information on the supported `aiplatform.Model.deploy` arguments, you can check its Python reference at https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform.Model#google_cloud_aiplatform_Model_deploy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "deployed_model = model.deploy(\n",
    "    endpoint=endpoint,\n",
    "    machine_type=\"g2-standard-24\",\n",
    "    accelerator_type=\"NVIDIA_L4\",\n",
    "    accelerator_count=2,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "deployed_model.to_dict()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model.predict_schemata"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**WARNING**: _The Vertex AI endpoint deployment via the `deploy` method may take from 15 to 25 minutes._"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Online predictions on Vertex AI"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, you can run the online predictions on Vertex AI using the `predict` method, which will send the requests to the running endpoint in the `/predict` route specified within the container following Vertex AI I/O payload formatting.\n",
    "\n",
    "As you are serving a `text-generation` model, you will need to make sure that the chat template, if any, is applied correctly to the input conversation; meaning that `transformers` need to be installed so as to instantiate the `tokenizer` for [`google/gemma-7b-it`](https://huggingface.co/google/gemma-7b-it) and run the `apply_chat_template` method over the input conversation before sending the input within the payload to the Vertex AI endpoint.\n",
    "\n",
    "**Note:** The chat template might not be needed for Gemma-2b/the model PaliGemma is based on."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install --upgrade --quiet transformers"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After the installation is complete, the following snippet will apply the chat template to the conversation:\n",
    "\n",
    ">**This isn't needed for PaliGemma but might be needed for Gemma-2b/the LM it is based on.**\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Via Python"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Within the same session"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you are willing to run the online prediction within the current session, you can send requests programmatically via the `aiplatform.Endpoint` (returned by the `aiplatform.Model.deploy` method) as in the following snippet:\n",
    "\n",
    "\n",
    "Here, we have listed code to supply images to the model via two different ways: (1) via urls, and (2) from local files. KM has extrapolated this syntax from the TGI docs on huggingface: https://huggingface.co/docs/text-generation-inference/en/basic_tutorials/visual_language_models"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import base64\n",
    "import requests\n",
    "import io"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**URL Based**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "url = \"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/rabbit.png\"\n",
    "url = \"https://<anon>/test.png\"\n",
    "\n",
    "output = endpoint.predict(\n",
    "    instances=[\n",
    "        {\n",
    "            \"inputs\":f\"![]({url}) What is in this image?\\n\",\n",
    "            \"parameters\":{\"max_new_tokens\": 100, \"do_sample\": False}\n",
    "        }\n",
    "    ]\n",
    ")\n",
    "# > helmet"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "output"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Locally**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "image_path = \"fgqa_hs/images/output.png\"\n",
    "image_path = \"/home/<anon>/Downloads/rabbit.png\"\n",
    "from PIL import Image\n",
    "\n",
    "# Open an image file\n",
    "with open(image_path, \"rb\") as f:\n",
    "\n",
    "    \n",
    "    image = base64.b64encode(f.read()).decode(\"utf-8\")\n",
    "\n",
    "    #image = f\"data:image/png;base64,{image}\"\n",
    "\n",
    "    output = deployed_model.predict(\n",
    "        instances=[\n",
    "            {\"prompt\": \"Describe this image in detail:\\n\",\"base64_image\":image}\n",
    "        ]\n",
    "    )\n",
    "#> space suit"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# is different when image is passed from a local file vs. from an url\n",
    "# even when we use greedy decoding (I think?), strange!\n",
    "output"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "MODEL_ID = \"liuhaotian/llava-v1.5-13b\"\n",
    "SERVE_DOCKER_URI = \"us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-llava-serve\"\n",
    "\n",
    "def deploy_model(model_id: str) -> tuple[aiplatform.Model, aiplatform.Endpoint]:\n",
    "   \"\"\"Uploads and deploys the model to Vertex AI endpoint for prediction.\"\"\"\n",
    "   model_name = \"llava_1.5\"\n",
    "   endpoint = aiplatform.Endpoint.create(display_name=f\"{model_name}-endpoint\")\n",
    "   serving_env = {\n",
    "       \"MODEL_ID\": model_id,\n",
    "       \"TS_NUM_WORKERS\": 1,\n",
    "       \"PRECISION_MODE\": \"4bit\",\n",
    "       \"DEPLOY_SOURCE\": \"custom\",\n",
    "   }\n",
    "   # If the model_id is a GCS path, use artifact_uri to pass it to serving docker.\n",
    "   artifact_uri = model_id if model_id.startswith(\"gs://\") else None\n",
    "   model = aiplatform.Model.upload(\n",
    "       display_name=model_name,\n",
    "       serving_container_image_uri=SERVE_DOCKER_URI,\n",
    "       serving_container_ports=[7080],\n",
    "       serving_container_predict_route=\"/predictions/llava_serving\",\n",
    "       serving_container_health_route=\"/ping\",\n",
    "       serving_container_environment_variables=serving_env,\n",
    "       artifact_uri=artifact_uri,\n",
    "   )\n",
    "   deployed_model = model.deploy(\n",
    "       endpoint=endpoint,\n",
    "       machine_type=\"g2-standard-24\",\n",
    "       accelerator_type=\"NVIDIA_L4\",\n",
    "       accelerator_count=2,\n",
    "       deploy_request_timeout=1800,\n",
    "   )\n",
    "   return model, deployed_model, endpoint\n",
    "\n",
    "model, deployed_model, endpoint = deploy_model(model_id=MODEL_ID)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "import base64\n",
    "import io\n",
    "import requests\n",
    "\n",
    "\n",
    "from PIL import Image\n",
    "\n",
    "\n",
    "def get_remote_image(path):\n",
    "   \"\"\"Downloads a remote image into memory.\"\"\"\n",
    "   response = requests.get(path)\n",
    "   image = Image.open(io.BytesIO(response.content))\n",
    "   return image\n",
    "\n",
    "\n",
    "def image_to_base64(image: Image.Image) -> str:\n",
    "   \"\"\"Converts a PIL image to a base64 string.\"\"\"\n",
    "   buffer = io.BytesIO()\n",
    "   image.save(buffer, format=\"JPEG\")\n",
    "   image_str = base64.b64encode(buffer.getvalue()).decode(\"utf-8\")\n",
    "   return image_str\n",
    "\n",
    "\n",
    "prompt = 'What is in this image?'\n",
    "image_path = 'https://<anon>/test.png'\n",
    "\n",
    "\n",
    "image = get_remote_image(image_path)\n",
    "instances = [\n",
    "     {\n",
    "         'prompt': prompt,\n",
    "         'base64_image': image_to_base64(image)\n",
    "     }\n",
    "]\n",
    "preds = endpoint.predict(instances=instances).predictions\n",
    "print(preds)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Producing the following `output`:\n",
    "\n",
    "```\n",
    "Prediction(predictions=['space suit'], deployed_model_id='6484700777808920576', metadata=None, model_version_id='1', model_resource_name='projects/20178026/locations/us-central1/models/1030014796319162368', explanations=None)\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def predict_answers(model, images, questions, gen_config):\n",
    "    assert len(images) == len(questions)\n",
    "\n",
    "    instances = []\n",
    "\n",
    "    #generation_config = {\n",
    "    #    \"max_new_tokens\": 256,\n",
    "    #    \"do_sample\": True,\n",
    "    #    \"top_p\": 0.2,\n",
    "    #    \"temperature\": 0.2,\n",
    "    #}\n",
    "\n",
    "    binarized_images = []\n",
    "    for i, image in enumerate(images):\n",
    "        img = f\"data:image/png;base64,{image}\"\n",
    "        instance = {\"inputs\":f\"![]({img}){questions[i]}\\n\\n\",\n",
    "                \"parameters\": gen_config}\n",
    "        instances.append(instance)\n",
    "    print(len(instances[0][\"inputs\"]))\n",
    "    output = model.predict(\n",
    "        instances=instances\n",
    "    )\n",
    "    return output"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "image_path = \"/home/<anon>/Downloads/rabbit.png\"\n",
    "\n",
    "generation_config = {\"max_new_tokens\": 100, \"do_sample\": False}\n",
    "\n",
    "with open(image_path, \"rb\") as f:\n",
    "    image = base64.b64encode(f.read()).decode(\"utf-8\")\n",
    "    images = [image, image]\n",
    "    output = predict_answers(deployed_model, \n",
    "                             images, \n",
    "                             [\"What is in the image?\", \"Is the animal in the image threatening?\"], \n",
    "                             generation_config)\n",
    "for prediction in output.predictions:\n",
    "    print(prediction)\n",
    "print(output)\n",
    "#> space suit"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Evaluation task\n",
    "\n",
    "Based on [https://huggingface.co/docs/google-cloud/main/examples/vertex-ai-notebooks-evaluate-llms-with-vertex-ai](https://huggingface.co/docs/google-cloud/main/examples/vertex-ai-notebooks-evaluate-llms-with-vertex-ai)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "from datasets import load_dataset\n",
    "\n",
    "dataset = load_dataset(\"fgqa_hs\", split='test[:100]')\n",
    "\n",
    "dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "dataset[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#dataset = dataset.select(range(100))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We must convert to a pandas dataset in order to use the Vertex Evaluation API"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = dataset.to_pandas()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def binarize_image(img_path, img_type='jpg'):\n",
    "    \n",
    "    with open(f\"{img_path}\", \"rb\") as f:\n",
    "        image = base64.b64encode(f.read()).decode(\"utf-8\")\n",
    "    return f\"data:image/{img_type};base64,{image}\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Step 0. Convert images to base64 representations"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from PIL import Image\n",
    "df[\"base64_image\"] = df[\"image\"].apply(lambda x: image_to_base64(Image.open(x['path'])))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#df['prompt'] = df.apply(lambda row: f\"![]({row['img']}) {row['question']}\", axis=1)\n",
    "df['prompt'] = df.apply(lambda row: (row['base64_image'], f\"Answer with a single word or concept. {row['question']}\"), axis=1)\n",
    "\n",
    "df['prompt'][1][1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['reference'] = df['answer']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Drop all columns that we do not need for the prediction task"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def generate(prompt, generation_config=generation_config):\n",
    "\n",
    "    payload = prompt_to_payload(prompt, generation_config)\n",
    "    output = endpoint.predict(instances=[payload])\n",
    "    generated_text = output.predictions[0]\n",
    "    #print(output.predictions)\n",
    "    return generated_text.lower()\n",
    "\n",
    "def prompt_to_payload(prompt, generation_config):\n",
    "    return {\"base64_image\": prompt[0],\"prompt\": prompt[1]}\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "prompt_data = []\n",
    "for row in df['prompt']:\n",
    "    prompt_data.append(prompt_to_payload(row,{}))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "with open(\"data.json\", 'w') as f:\n",
    "    for item in prompt_data:\n",
    "        f.write(json.dumps(item) + \"\\n\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "generate(df['prompt'][0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from vertexai.evaluation import EvalTask\n",
    "from vertexai.generative_models import (Part)\n",
    "# 2. create eval task\n",
    "eval_task = EvalTask(\n",
    "        dataset=df,\n",
    "        metrics=[\"exact_match\"],\n",
    "        experiment=\"multimodal-hypernym-semantics\",\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "import uuid\n",
    "# 3. run eval task\n",
    "# Note: If the last iteration takes > 1 minute you might need to retry the evaluation\n",
    "exp_results = eval_task.evaluate(\n",
    "        model=generate, experiment_run_name=f\"test-gqa-{str(uuid.uuid4())[:8]}\"\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "def predict(prompts, generation_config=generation_config):\n",
    "    payloads = [prompt_to_payload(prompt, generation_config) for prompt in prompts]\n",
    "    print(payloads)\n",
    "    output = endpoint.predict(instances=payloads)\n",
    "    generated_texts = output.predictions\n",
    "    #print(output.predictions)\n",
    "    return [pred.lower() for pred in generated_texts]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "\n",
    "from tqdm import tqdm\n",
    "predictions = []\n",
    "for index, row in tqdm(df.iterrows(), total=len(df['question'])):\n",
    "    pred = generate([row['base64_image'], row['question']])        \n",
    "    predictions.append(pred)\n",
    "df['online_responses'] = predictions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['online_responses']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "results = {}\n",
    "print(exp_results.summary_metrics)\n",
    "print(f\"{exp_results.summary_metrics['exact_match/mean']}\")\n",
    "results[\"test\"] = exp_results.summary_metrics[\"exact_match/mean\"]\n",
    "\n",
    "for prompt_name, score in sorted(results.items(), key=lambda x: x[1], reverse=True):\n",
    "    print(f\"{prompt_name}: {score}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "jupyter": {
     "source_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "import http.client\n",
    "import typing\n",
    "import urllib.request\n",
    "\n",
    "import IPython.display\n",
    "from PIL import Image as PIL_Image\n",
    "from PIL import ImageOps as PIL_ImageOps\n",
    "\n",
    "\n",
    "def display_images(\n",
    "    images: typing.Iterable[Image],\n",
    "    max_width: int = 600,\n",
    "    max_height: int = 350,\n",
    ") -> None:\n",
    "    for image in images:\n",
    "        pil_image = typing.cast(PIL_Image.Image, image._pil_image)\n",
    "        if pil_image.mode != \"RGB\":\n",
    "            # RGB is supported by all Jupyter environments (e.g. RGBA is not yet)\n",
    "            pil_image = pil_image.convert(\"RGB\")\n",
    "        image_width, image_height = pil_image.size\n",
    "        if max_width < image_width or max_height < image_height:\n",
    "            # Resize to display a smaller notebook image\n",
    "            pil_image = PIL_ImageOps.contain(pil_image, (max_width, max_height))\n",
    "        IPython.display.display(pil_image)\n",
    "\n",
    "\n",
    "def get_image_bytes_from_url(image_url: str) -> bytes:\n",
    "    with urllib.request.urlopen(image_url) as response:\n",
    "        response = typing.cast(http.client.HTTPResponse, response)\n",
    "        image_bytes = response.read()\n",
    "    return image_bytes\n",
    "\n",
    "\n",
    "def load_image_from_url(image_url: str) -> Image:\n",
    "    image_bytes = get_image_bytes_from_url(image_url)\n",
    "    return Image.from_bytes(image_bytes)\n",
    "\n",
    "\n",
    "def get_url_from_gcs(gcs_uri: str) -> str:\n",
    "    # converts GCS uri to url for image display.\n",
    "    url = \"https://storage.googleapis.com/\" + gcs_uri.replace(\"gs://\", \"\").replace(\n",
    "        \" \", \"%20\"\n",
    "    )\n",
    "    return url\n",
    "\n",
    "\n",
    "def print_multimodal_prompt(contents: list):\n",
    "    \"\"\"\n",
    "    Given contents that would be sent to Gemini,\n",
    "    output the full multimodal prompt for ease of readability.\n",
    "    \"\"\"\n",
    "    for content in contents:\n",
    "        if isinstance(content, Image):\n",
    "            display_images([content])\n",
    "        elif isinstance(content, Part):\n",
    "            url = get_url_from_gcs(content.file_data.file_uri)\n",
    "            IPython.display.display(load_image_from_url(url))\n",
    "        else:\n",
    "            print(content)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Check predictions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "exp_results.metrics_table[['question', 'response', 'reference', 'argument', 'substitution', 'exact_match/score']]\n",
    "#exp_results.metrics_table[['question', 'response', 'reference', 'argument', 'substitution', 'exact_match/score', 'rouge/score']]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "result_df = exp_results.metrics_table[['question', 'response', 'reference', 'argument', 'substitution', 'exact_match/score']]\n",
    "#result_df = exp_results.metrics_table[['question', 'response', 'reference', 'argument', 'substitution', 'exact_match/score', 'rouge/score']]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "result_df.to_csv('llava-100-test.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "aggregated_base_questions = result_df[result_df['substitution'] == ''].groupby('argument').agg({\n",
    "    'exact_match/score': 'mean',\n",
    "  #  'rouge/score': 'mean'\n",
    "}).reset_index()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "aggregated_base_questions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Aggregating over all substitutions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "aggregated_substitutions = result_df[result_df['substitution'] != ''].groupby('argument').agg({\n",
    "    'exact_match/score': 'mean',\n",
    "#    'rouge/score': 'mean'\n",
    "}).reset_index()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "aggregated_substitutions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "aggregated_combined = aggregated_base_questions.rename(columns={'exact_match/score': 'base/exact_match/score',\n",
    "                                                                #'rouge/score':'base/rouge/score'\n",
    "                                                               })\n",
    "\n",
    "# Merge the two dataframes on a common key (in this case, 'key')\n",
    "aggregated_combined = pd.merge(aggregated_combined, aggregated_substitutions, on='argument', how='left')\n",
    "\n",
    "# Fill empty values with 0.0\n",
    "aggregated_combined = aggregated_combined.fillna(0.0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "aggregated_combined"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(exp_results.metrics_table['response'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from IPython.display import Image\n",
    "\n",
    "from PIL import Image as PImage\n",
    "\n",
    "imgs = set([img['path'] for img in df['image'][:10]])\n",
    "\n",
    "for image in imgs:\n",
    "    #Image(filename=image['path'])\n",
    "    img = mpimg.imread(image)\n",
    "    plt.imshow(img)\n",
    "    plt.show()\n",
    "    img = image_to_base64(PImage.open(image))\n",
    "    output = generate([img, \"What is in the image?\"])\n",
    "    print(output)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "import matplotlib.image as mpimg\n",
    "image = mpimg.imread(image['path'])\n",
    "plt.imshow(image)\n",
    "plt.show()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Batch inference\n",
    "\n",
    "Below is example code from the GCP documentation found at (https://cloud.google.com/vertex-ai/docs/predictions/get-batch-predictions)[https://cloud.google.com/vertex-ai/docs/predictions/get-batch-predictions]\n",
    "\n",
    "Also check (https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/generative_ai/batch_eval_llm.ipynb)[https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/generative_ai/batch_eval_llm.ipynb]\n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def create_batch_prediction_job_dedicated_resources_sample(\n",
    "    model,\n",
    "    job_display_name: str,\n",
    "    gcs_source,\n",
    "    gcs_destination: str,\n",
    "    machine_type=\"g2-standard-24\", #$0.8129 USD / hour\n",
    "    accelerator_type=\"NVIDIA_L4\", #$0.644046 USD / hour\n",
    "    accelerator_count=2,\n",
    "    instances_format: str = \"jsonl\",\n",
    "    starting_replica_count: int = 1,\n",
    "    max_replica_count: int = 1,\n",
    "    sync: bool = True,\n",
    "):\n",
    "\n",
    "\n",
    "    batch_prediction_job = model.batch_predict(\n",
    "        job_display_name=job_display_name,\n",
    "        gcs_source=gcs_source,\n",
    "        gcs_destination_prefix=gcs_destination,\n",
    "        instances_format=instances_format,\n",
    "        starting_replica_count=starting_replica_count,\n",
    "        max_replica_count=max_replica_count,\n",
    "        \n",
    "        machine_type=machine_type,\n",
    "        accelerator_type=accelerator_type,\n",
    "        accelerator_count=accelerator_count,\n",
    "        sync=sync,\n",
    "    )\n",
    "\n",
    "    batch_prediction_job.wait()\n",
    "\n",
    "    print(batch_prediction_job.display_name)\n",
    "    print(batch_prediction_job.resource_name)\n",
    "    print(batch_prediction_job.state)\n",
    "    return batch_prediction_job"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "batch_prediction_job = create_batch_prediction_job_dedicated_resources_sample(\n",
    "    model,\n",
    "        job_display_name=\"batch-llava-test-100\",\n",
    "        gcs_source=\"gs://multimodal-representations-eval-data/data.jsonl\",\n",
    "        gcs_destination=\"gs://multimodal-representations-eval-data/\",\n",
    ")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Resource clean-up (DEFINITELY DO THIS)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, you can already release the resources that you've created as follows, to avoid unnecessary costs:\n",
    "\n",
    "* `deployed_model.undeploy_all` to undeploy the model from all the endpoints.\n",
    "* `deployed_model.delete` to delete the endpoint/s where the model was deployed gracefully, after the `undeploy_all` method.\n",
    "* `model.delete` to delete the model from the registry."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "deployed_model.undeploy_all()\n",
    "deployed_model.delete()\n",
    "model.delete()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Alternatively, you can also remove those from the Google Cloud Console following the steps:\n",
    "\n",
    "* Go to Vertex AI in Google Cloud\n",
    "* Go to Deploy and use -> Online prediction\n",
    "* Click on the endpoint and then on the deployed model/s to \"Undeploy model from endpoint\"\n",
    "* Then go back to the endpoint list and remove the endpoint\n",
    "* Finally, go to Deploy and use -> Model Registry, and remove the model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Disable APIs\n",
    "\n",
    "!gcloud services disable aiplatform.googleapis.com\n",
    "!gcloud services disable compute.googleapis.com\n",
    "!gcloud services disable container.googleapis.com\n",
    "!gcloud services disable containerregistry.googleapis.com\n",
    "!gcloud services disable containerfilesystem.googleapis.com"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### PLEASE ALSO MANUALLY ENSURE ALL APIS ARE DISABLED ON GCP AFTER THIS IS DONE!\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Download an image from Google Cloud Storage\n",
    "# Load from local file\n",
    "from vertexai.generative_models import Image as V_Image\n",
    "\n",
    "gen_model = GenerativeModel(\"paligemma\")\n",
    "\n",
    "image = V_Image.load_from_file(df['image'][0]['path'])\n",
    "\n",
    "# Prepare contents\n",
    "prompt = \"Describe this image?\"\n",
    "contents = [image, prompt]\n",
    "\n",
    "response = gen_model.generate_content(contents)\n",
    "\n",
    "print(\"-------Prompt--------\")\n",
    "print_multimodal_prompt(contents)\n",
    "\n",
    "print(\"\\n-------Response--------\")\n",
    "print(response.text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "endpoints = aiplatform.Endpoint.list()\n",
    "for i in endpoints:\n",
    "        i.undeploy_all()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Alternatives for PaliGemma\n",
    "\n",
    "https://ai.google.dev/gemma/docs/paligemma/inference-with-keras"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
