{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0\n"
     ]
    }
   ],
   "source": [
    "import cv2\n",
    "import numpy as np\n",
    "import base64\n",
    "import requests\n",
    "import ai2thor.controller\n",
    "\n",
    "controller = ai2thor.controller.Controller(width=800, height=800, scene=\"FloorPlan1\")\n",
    "\n",
    "# agentCount specifies the number of agents in a scene\n",
    "# event = controller.step(dict(action='Initialize', gridSize=0.25, agentCount=2))\n",
    "event = controller.step(\n",
    "    dict(action=\"Initialize\", gridSize=0.25, renderObjectImage=True, agentCount=1)\n",
    ")\n",
    "\n",
    "# print out agentIds\n",
    "for e in event.events:\n",
    "    print(e.metadata[\"agentId\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "event = controller.step(\n",
    "        dict(\n",
    "            action=\"Teleport\",\n",
    "            position=dict(x=1.5, y=0.9, z=-1.5),\n",
    "            rotation=dict(x=0, y=270, z=0),\n",
    "            agentId=int(0),\n",
    "        )\n",
    "    )\n",
    "event = controller.step(dict(action=\"MoveAhead\", agentId=0))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cv2_img = event.cv2img\n",
    "cv2.imwrite(\"test.png\", cv2_img)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "def encode_image(image_path):\n",
    "    with open(image_path, \"rb\") as image_file:\n",
    "        return base64.b64encode(image_file.read()).decode(\"utf-8\")\n",
    "image_path = \"test.png\"\n",
    "# image_path = \"~/Downloads/SayCanPlayground/AI2Thor/multimodal/test.jpg\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "api_key = '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'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "You are an excellent image summariser who is tasked with helping 2 robots: Alice and Bob. Each robot has a partially observable view of the environment. Hence they have to explore around in the environment to do the task.\n",
      "Both robots can perform the following tasks: [\"Move(direction)\", \"Rotate(Rotation)\", \"PickupObject(object_id)\", \"PutObject(receptacle_id)\", \"OpenObject(object_id)\", \"CloseObject(object_id)\"]\n",
      "\"Move(direction)\" makes the robot move in that direction. \n",
      "\"Rotate(rotation)\" makes the robot rotate by 90 degrees towards that side.\n",
      "\"PickupObject(object_id)\" allows the robot to pickup object:<object_id>.\n",
      "\"PutObject(receptacle_id)\" allows the robot to place object that has been already picked up on receptacle:<receptacle_id>.\n",
      "\"OpenObject(object_id)\" allows the robot to open the object:<object_id>.\n",
      "\"CloseObject(object_id)\" allows the robot to close the object:<object_id>.\n",
      "Here <direction> can be one of [\"Ahead\", \"Back\", \"Left\", \"Right\"], and <rotation> can be one of [\"Right\", \"Left\"].\n",
      "You are supposed to give the next step Alice and Bob are supposed to take such that they complete the task more efficiently as compared to if only one of them was doing the task.\n",
      "Example output: ['Move(Ahead)', 'Rotate(Right)']\n",
      "\n",
      "You will get a description of the task Alice and Bob are supposed to do. You will get images of the environment from the robot's perspective as the observation input. To help you with detecting objects in the image, you will also get a list objects each agent is able to see in the environment. Here the objects are named as \"<object_name>_<object_id>\". \n",
      "So, along with the image inputs you will get the following information:\n",
      "Input format: {Task: description of the task Alice and Bob are supposed to do,\n",
      "Alice's observation list: list of objects that Alice is observing,\n",
      "Bob's observation list: list of objects that Alice is observing,\n",
      "Task Success: Whether the task is done or not}\n",
      "\n",
      "### INPUT FORMAT ###\n",
      "{Task: description of the task Alice and Bob are supposed to do, \n",
      "Alice's observation: list of objects that Alice is observing,\n",
      "Alice's state: description of Alice's state,\n",
      "Alice's previous observation: description of what Alice saw in the previous time step,\n",
      "Alice's previous action: description of what Alice did in the previous time step and whether it was successful,\n",
      "Alice's previous failures: if Alice's few previous actions failed, description of what failed,\n",
      "Bob's observation: list of objects that Bob is observing,\n",
      "Bob's state: description of Bob's state,\n",
      "Bob's previous observation: description of what Alice saw in the previous time step,\n",
      "Bob's previous action: description of what Alice did in the previous time step and whether it was successful,\n",
      "Bob's previous failures: if Bob's few previous actions failed, description of what failed,\n",
      "Task Success: Whether the task is done or not}.\n",
      "\n",
      "Remember that <object_id> and <receptacle_id> can only be from what the agent sees in \"<agent_name>'s Observation\". Here <agent_name> is either Alice or Bob.  The robots can hold only one object at a time.\n",
      "\n",
      "The robots can hold only one object at a time.\n",
      "Even if the agents can see objects, they might not be able to interact with them if they are too far away. Hence you will need to make the agents move closer to the objects they want to interact with.\n",
      "For example: An action like \"PickupObject(object_id)\" is feasible only if the robot can see the object and is close enough to it. So you will have to move closer to it before you can pick it up.\n",
      "If you are too close to an object, you might not be able to move in a particular direction. \n",
      "For example: If you are too close to a wall, you might not be able to move \"Ahead\" or \"Back\".\n",
      "So if a particular action fails, you will have to choose a different action for that robot. Also, if you open an object, please ensure that you close it before you move to a different place.\n",
      "\n"
     ]
    }
   ],
   "source": [
    "from prompt import PROMPT\n",
    "print(PROMPT)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "base64_image = encode_image(image_path)\n",
    "\n",
    "headers = {\"Content-Type\": \"application/json\", \"Authorization\": f\"Bearer {api_key}\"}\n",
    "\n",
    "payload = {\n",
    "    \"model\": \"gpt-4-vision-preview\",\n",
    "    \"messages\": [\n",
    "        {\n",
    "            \"role\": \"user\",\n",
    "            \"content\": [\n",
    "                {\"type\": \"text\", \"text\": \"What’s in this image?\"},\n",
    "                {\n",
    "                    \"type\": \"image_url\",\n",
    "                    \"image_url\": {\"url\": f\"data:image/jpeg;base64,{base64_image}\"},\n",
    "                },\n",
    "            ],\n",
    "        }\n",
    "    ],\n",
    "    \"max_tokens\": 300,\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "PROMPT = \"\"\"You are an excellent planner who is tasked with helping a robot carry out a task. The robot has a partially observable view of the environment. Hence it has to explore around in the environment to do the task.\n",
    "The robot can perform the following tasks: [\"Move(direction)\", \"Rotate(Rotation)\", \"PickupObject(object_id)\", \"PutObject(receptacle_id)\", \"OpenObject(object_id)\", \"CloseObject(object_id)\"]\n",
    "\"Move(direction)\" makes the robot move in that direction. \n",
    "\"Rotate(rotation)\" makes the robot rotate by 90 degrees towards that side.\n",
    "\"PickupObject(object_id)\" allows the robot to pickup object:<object_id>.\n",
    "\"PutObject(receptacle_id)\" allows the robot to place object that has been already picked up on receptacle:<receptacle_id>.\n",
    "\"OpenObject(object_id)\" allows the robot to open the object:<object_id>.\n",
    "\"CloseObject(object_id)\" allows the robot to close the object:<object_id>.\n",
    "Here <direction> can be one of [\"Ahead\", \"Back\", \"Left\", \"Right\"], and <rotation> can be one of [\"Right\", \"Left\"].\n",
    "You are supposed to give the next step the robot is supposed to take such that it completes the task.\n",
    "\n",
    "You will get a description of the task the robot is supposed to do. You will get an image of the environment from the robot's perspective as the observation input. To help you with detecting objects in the image, you will also get a list objects the agent is able to see in the environment. Here the objects are named as \"<object_name>_<object_id>\". \n",
    "So, along with the image inputs you will get the following information:\n",
    "### INPUT FORMAT ###\n",
    "{Task: description of the task the robot is supposed to do, \n",
    "robot's observation: list of objects the robot is observing,\n",
    "robot's state: description of robot's state,\n",
    "robot's previous observation: description of what robot saw in the previous time step,\n",
    "robot's previous action: description of what robot did in the previous time step and whether it was successful,\n",
    "robot's previous failures: if robot's few previous actions failed, description of what failed,\n",
    "Task Success: Whether the task is done or not}.\n",
    "\n",
    "Remember that <object_id> and <receptacle_id> can only be from what the agent sees in \"<agent_name>'s Observation\". Here <agent_name> is either Alice or Bob.  The robots can hold only one object at a time.\n",
    "\n",
    "The robot can hold only one object at a time.\n",
    "Even if the agent can see objects, it might not be able to interact with them if they are too far away. Hence you will need to make the agent move closer to the objects they want to interact with.\n",
    "For example: An action like \"PickupObject(object_id)\" is feasible only if the robot can see the object and is close enough to it. So you will have to move closer to it before you can pick it up.\n",
    "If you are too close to an object, you might not be able to move in a particular direction. \n",
    "For example: If you are too close to a wall, you might not be able to move \"Ahead\" or \"Back\".\n",
    "So if a particular action fails, you will have to choose a different action for the robot. Also, if you open an object, please ensure that you close it before you move to a different place.\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "PROMPT = \"\"\"You are an excellent planner who is tasked with helping a robot carry out a task. The robot has a partially observable view of the environment. Hence it has to explore around in the environment to do the task.\n",
    "The robot can perform the following tasks: [\"Move(direction)\", \"Rotate(Rotation)\", \"PickupObject(object_id)\", \"PutObject(receptacle_id)\", \"OpenObject(object_id)\", \"CloseObject(object_id)\"]\n",
    "\"Move(direction)\" makes the robot move in that direction. \n",
    "\"Rotate(rotation)\" makes the robot rotate by 90 degrees towards that side.\n",
    "\"PickupObject(object_id)\" allows the robot to pickup object:<object_id>.\n",
    "\"PutObject(receptacle_id)\" allows the robot to place object that has been already picked up on receptacle:<receptacle_id>.\n",
    "\"OpenObject(object_id)\" allows the robot to open the object:<object_id>.\n",
    "\"CloseObject(object_id)\" allows the robot to close the object:<object_id>.\n",
    "Here <direction> can be one of [\"Ahead\", \"Back\", \"Left\", \"Right\"], and <rotation> can be one of [\"Right\", \"Left\"].\n",
    "You are supposed to give the next step the robot is supposed to take such that it completes the task.\n",
    "\n",
    "You will get a description of the task the robot is supposed to do. You will get an image of the environment from the robot's perspective as the observation input. To help you with detecting objects in the image, you will also get a list objects the agent is able to see in the environment. Here the objects are named as \"<object_name>_<object_id>\". \n",
    "So, along with the image inputs you will get the following information:\n",
    "### INPUT FORMAT ###\n",
    "{Task: description of the task the robot is supposed to do, \n",
    "robot's observation: list of objects the robot is observing,\n",
    "robot's state: description of robot's state,\n",
    "robot's previous observation: description of what robot saw in the previous time step,\n",
    "robot's previous action: description of what robot did in the previous time step and whether it was successful,\n",
    "robot's previous failures: if robot's few previous actions failed, description of what failed}\n",
    "\n",
    "Remember that <object_id> and <receptacle_id> can only be from what the agent sees in \"<agent_name>'s Observation\". Here <agent_name> is either Alice or Bob.  The robots can hold only one object at a time.\n",
    "\n",
    "The robot can hold only one object at a time.\n",
    "Even if the agent can see objects, it might not be able to interact with them if they are too far away. Hence you will need to make the agent move closer to the objects they want to interact with.\n",
    "For example: An action like \"PickupObject(object_id)\" is feasible only if the robot can see the object and is close enough to it. So you will have to move closer to it before you can pick it up.\n",
    "If you are too close to an object, you might not be able to move in a particular direction. \n",
    "For example: If you are too close to a wall, you might not be able to move \"Ahead\" or \"Back\".\n",
    "So if a particular action fails, you will have to choose a different action for the robot. Also, if you open an object, please ensure that you close it before you move to a different place.\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "base64_image = encode_image(image_path)\n",
    "\n",
    "headers = {\"Content-Type\": \"application/json\", \"Authorization\": f\"Bearer {api_key}\"}\n",
    "\n",
    "\n",
    "payload = {\n",
    "    \"model\": \"gpt-4-vision-preview\",\n",
    "    \"messages\": [\n",
    "        {\n",
    "            \"role\": \"system\",\n",
    "            \"content\": [\n",
    "                {\"type\": \"text\", \"text\": PROMPT},\n",
    "            ],\n",
    "        },\n",
    "        {\n",
    "            \"role\": \"user\",\n",
    "            \"content\": [\n",
    "                {\"type\": \"text\", \"text\": user_prompt},\n",
    "                {\n",
    "                    \"type\": \"image_url\",\n",
    "                    \"image_url\": {\"url\": f\"data:image/jpeg;base64,{base64_image}\"},\n",
    "                },\n",
    "            ],\n",
    "        }\n",
    "    ],\n",
    "    \"max_tokens\": 300,\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "response = requests.post(\n",
    "    \"https://api.openai.com/v1/chat/completions\", headers=headers, json=payload\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "response_dict = response.json()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "This image shows a modern kitchen interior. There is a range of kitchen appliances such as an electric stove and an over-the-range microwave. A refrigerator can be seen in the corner, and there's a sink with a faucet in the countertop. The kitchen features white cabinetry with metallic handles, and there seems to be a subway tile backsplash. On the countertops, there are a few items like a toaster, a bowl, and some fresh produce (an apple and a lime).\n",
      "\n",
      "A decorative pendant light is hanging above the kitchen island, and there is a larger light fixture in the room as well. The flooring appears to be hardwood, and the countertop looks like it's made of marble or similar material. The kitchen also has a significant skylight letting in natural light from above, and there's a window with a view of trees and sunlight, suggesting a pleasant, sunny day outside.\n"
     ]
    }
   ],
   "source": [
    "print(response_dict[\"choices\"][0][\"message\"][\"content\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "py310",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.10"
  },
  "orig_nbformat": 4
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
