
from typing import TYPE_CHECKING, Dict
import sys
import json
PROMPT_PREFIX = """You are an agent who can operate an Android phone on behalf of a user.
Based on user's goal/request, you may complete some tasks described in the requests/goals by performing actions (step by step) on the phone.

When given a user request, you will try to complete it step by step. At each step, you will be given the current screenshot and a history of what you have done (in text). Based on these pieces of information and the goal, you must choose to perform one of the action in the following list (action description followed by the JSON format) by outputting the action in the correct JSON format.
- Click/tap on an element on the screen, describe the element you want to operate with: `{"action_type": "click", "element": <target_element_description>}`
- Long press on an element on the screen, similar with the click action above: `{"action_type": "long_press", "description": <target_element_description>}`
- Type text into a text field: `{"action_type": "type_text", "text": <text_input>, "element": <target_element_description>}`
- Scroll the screen in one of the four directions: `{"action_type": "scroll", "direction": <up, down, left, right>}`
- Navigate to the home screen: `{"action_type": "navigate_home"}`
- Navigate back: `{"action_type": "navigate_back"}`
- Open an app (nothing will happen if the app is not installed): `{"action_type": "open_app", "app_name": <name>}`
- Wait for the screen to update: `{"action_type": "wait"}`
"""

GUIDANCE = """Here are some useful guidelines you need to follow:
General:
- Usually there will be multiple ways to complete a task, pick the easiest one. Also when something does not work as expected (due to various reasons), sometimes a simple retry can solve the problem, but if it doesn't (you can see that from the history), SWITCH to other solutions.
- If the desired state is already achieved (e.g., enabling Wi-Fi when it's already on), you can just complete the task.

Action Related:
- Use the `open_app` action whenever you want to open an app (nothing will happen if the app is not installed), do not use the app drawer to open an app unless all other ways have failed.
- Use the `type_text` action whenever you want to type something (including password) instead of clicking characters on the keyboard one by one. Sometimes there is some default text in the text field you want to type in, remember to delete them before typing.
- For `click`, `long_press` and `type_text`, the element you pick must be VISIBLE in the screenshot to interact with it.
- The 'element' field requires a concise yet comprehensive description of the target element in a single sentence, not exceeding 30 words. Include all essential information to uniquely identify the element. If you find identical elements, specify their location and details to differentiate them from others.
- Consider exploring the screen by using the `scroll` action with different directions to reveal additional content.
- The direction parameter for the `scroll` action specifies the direction in which the content moves and opposites to swipe; for example, to view content at the bottom, the `scroll` direction should be set to `down`.

Text Related Operations:
- Normally to select certain text on the screen: <i> Enter text selection mode by long pressing the area where the text is, then some of the words near the long press point will be selected (highlighted with two pointers indicating the range) and usually a text selection bar will also appear with options like `copy`, `paste`, `select all`, etc. <ii> Select the exact text you need. Usually the text selected from the previous step is NOT the one you want, you need to adjust the range by dragging the two pointers. If you want to select all text in the text field, simply click the `select all` button in the bar.
- At this point, you don't have the ability to drag something around the screen, so in general you can not select arbitrary text.
- To delete some text: the most traditional way is to place the cursor at the right place and use the backspace button in the keyboard to delete the characters one by one (can long press the backspace to accelerate if there are many to delete). Another approach is to first select the text you want to delete, then click the backspace button in the keyboard.
- To copy some text: first select the exact text you want to copy, which usually also brings up the text selection bar, then click the `copy` button in bar.
- To paste text into a text box, first long press the text box, then usually the text selection bar will appear with a `paste` button in it.
- When typing into a text field, sometimes an auto-complete dropdown list will appear. This usually indicating this is a enum field and you should try to select the best match by clicking the corresponding one in the list.
"""

OUTPUT_FORMAT="""
At each step, you will be given the current screenshot and a history of what you have done (in text). Based on these pieces of information and the goal, you must choose to perform one of the action in the following list (action description followed by the JSON format) by outputting the action in the correct JSON format.
- Click/tap on an element on the screen, describe the element you want to operate with: `{"action_type": "click", "x":<x_coordinate>, "y":<x_coordinate>}`
- Long press on an element on the screen, similar with the click action above: `{"action_type": "long_press","x":<x_coordinate>, "y":<y_coordinate>}`
- Type text into a text field: `{"action_type": "input_text", "text": <text_input>}`
- Scroll the screen in one of the four directions: `{"action_type": "scroll", "direction": <up, down, left, right>}`
- Navigate to the home screen: `{"action_type": "navigate_home"}`
- Navigate back: `{"action_type": "navigate_back"}`
- Open an app (nothing will happen if the app is not installed): `{"action_type": "open_app", "app_name": <name>}`
- Wait for the screen to update: `{"action_type": "wait"}`
"""


TEMPLATES: Dict[str, str] = {}

def get_register_template(model_name):
    if model_name not in TEMPLATES:
        sys.exit(f"not model named {model_name}")
    return TEMPLATES[model_name]


# TODO: QWEN2VL
QWEN2VL_SYS="""You are a helpful assistant.
"""

# QWEN2VL_USER="""
# Please generate the next action according to the screenshot image, instruction and previous actions.
# Instruction: {overall_goal} 
# Previous actions: {previous_actions}
# """


QWEN2VL_USER="""
You are a GUI agent. You need to perform the next action to complete the task. \n\n## Output Format\n\nThought: ...\nAction: ...\n\n\n## Action Space \nclick(coordinate='(relative_x,relative_y)')\nlong_press(coordinate='(relative_x,relative_y)')\ninput_text(content='')\nscroll(direction='down or up or right or left')\nopen_app(app_name='')\nnavigate_back()\nnavigate_home()\nwait() # Submit the task regardless of whether it succeeds or fails.\n\n## Note\n- Use English in Thought part.\n\n- Use coordinates in relative terms (from 0 to 1)\n\n- Summarize your next action (with its target element) in one sentence in Thought part.\n
Please generate the next action according to the instruction, previous actions and screenshot image. 
## Instruction: {overall_goal} 
## Previous actions: {previous_actions}
"""

qwen2vl_template = "{% set image_count = namespace(value=0) %}{% set video_count = namespace(value=0) %}{% for message in messages %}<|im_start|>{{ message['role'] }}\n{% if message['content'] is string %}{{ message['content'] }}<|im_end|>\n{% else %}{% for content in message['content'] %}{% if content['type'] == 'image' or 'image' in content or 'image_url' in content %}{% set image_count.value = image_count.value + 1 %}{% if add_vision_id %}Picture {{ image_count.value }}: {% endif %}<|vision_start|><|image_pad|><|vision_end|>{% elif content['type'] == 'video' or 'video' in content %}{% set video_count.value = video_count.value + 1 %}{% if add_vision_id %}Video {{ video_count.value }}: {% endif %}<|vision_start|><|video_pad|><|vision_end|>{% elif 'text' in content %}{{ content['text'] }}{% endif %}{% endfor %}<|im_end|>\n{% endif %}{% endfor %}{% if add_generation_prompt %}<|im_start|>assistant\n{% endif %}"


TEMPLATES['QWEN2VL']=[QWEN2VL_SYS,QWEN2VL_USER,OUTPUT_FORMAT,qwen2vl_template]


# TODO : QWEN25VL
QWEN25VL_SYS="""You are a helpful assistant."""

QWEN25VL_USER="""
Please generate the next action according to the screenshot image, instruction and previous actions.
Instruction: {overall_goal} 
Previous actions: {previous_actions}
"""
qwen25vl_template="{% set image_count = namespace(value=0) %}{% set video_count = namespace(value=0) %}{% for message in messages %}{% if loop.first and message['role'] != 'system' %}<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n{% endif %}<|im_start|>{{ message['role'] }}\n{% if message['content'] is string %}{{ message['content'] }}<|im_end|>\n{% else %}{% for content in message['content'] %}{% if content['type'] == 'image' or 'image' in content or 'image_url' in content %}{% set image_count.value = image_count.value + 1 %}{% if add_vision_id %}Picture {{ image_count.value }}: {% endif %}<|vision_start|><|image_pad|><|vision_end|>{% elif content['type'] == 'video' or 'video' in content %}{% set video_count.value = video_count.value + 1 %}{% if add_vision_id %}Video {{ video_count.value }}: {% endif %}<|vision_start|><|video_pad|><|vision_end|>{% elif 'text' in content %}{{ content['text'] }}{% endif %}{% endfor %}<|im_end|>\n{% endif %}{% endfor %}{% if add_generation_prompt %}<|im_start|>assistant\n{% endif %}"

TEMPLATES['QWEN25VL']=[QWEN25VL_SYS,QWEN25VL_USER,OUTPUT_FORMAT,qwen25vl_template]




# TODO: UI-TARS

# TODO: AGUVIS
swipe_func = {
    "name": "swipe",
    "description": "Swipe on the screen",
    "parameters": {
        "type": "object",
        "properties": {
            "from_coord": {
                "type": "array",
                "items": {"type": "number"},
                "description": "The starting coordinates of the swipe",
            },
            "to_coord": {
                "type": "array",
                "items": {"type": "number"},
                "description": "The ending coordinates of the swipe",
            },
        },
        "required": ["from_coord", "to_coord"],
    },
}

home_func = {"name": "home", "description": "Go to the home screen"}
back_func = {"name": "back", "description": "Go to the previous screen"}
recent_func = {"name": "recent", "description": "Go to the previous APP"}
complete_func = {"name": "complete", "description": "The sign that the task has been completed"}

long_press_func = {
    "name": "long_press",
    "description": "Long press on the screen",
    "parameters": {
        "type": "object",
        "properties": {
            "x": {
                "type": "number",
                "description": "The x coordinate of the long press",
            },
            "y": {
                "type": "number",
                "description": "The y coordinate of the long press",
            },
        },
        "required": ["x", "y"],
    },
}

terminate_func={ "name": "terminate", 
                "description": "Terminate the current task and report its completion status", 
                "parameters": { "type": "object", "properties": 
                    {"status": 
                        {"type": "string", "enum": ["success"], "description": "The status of the task"}}, 
                    "required": ["status"] } 
                }


AGUVIS_SYS=f"""You are a GUI agent. You are given a task and a screenshot of the screen. You need to perform a series of pyautogui actions to complete the task.
"""
aguvis_template="{% set image_count = namespace(value=0) %}{% set video_count = namespace(value=0) %}{% for message in messages %}{% if loop.first and message['role'] != 'system' %}<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n{% endif %}<|im_start|>{{ message['role'] }}\n{% if message['content'] is string %}{{ message['content'] }}<|im_end|>\n{% else %}{% for content in message['content'] %}{% if content['type'] == 'image' or 'image' in content or 'image_url' in content %}{% set image_count.value = image_count.value + 1 %}{% if add_vision_id %}Picture {{ image_count.value }}: {% endif %}<|vision_start|><|image_pad|><|vision_end|>{% elif content['type'] == 'video' or 'video' in content %}{% set video_count.value = video_count.value + 1 %}{% if add_vision_id %}Video {{ video_count.value }}: {% endif %}<|vision_start|><|video_pad|><|vision_end|>{% elif 'text' in content %}{{ content['text'] }}{% endif %}{% endfor %}<|im_end|>\n{% endif %}{% endfor %}{% if add_generation_prompt %}<|im_start|>assistant<|recipient|>all\nThought:{% endif %}"

AGUVIS_USER="""
Please generate the next move according to the ui screenshot, instruction and previous actions.

Instruction: {overall_goal} 

Previous actions: {previous_actions}
"""

TEMPLATES['AGUVIS']=[AGUVIS_SYS,AGUVIS_USER,OUTPUT_FORMAT,aguvis_template]


# TODO: Qwen2vl-llamafactory
LLAMA_SYS=f"""You are a helpful assistant.
"""
LLAMA_USER="""
You are a GUI agent. You are given a task and a screenshot of the screen. You need to perform a series of pyautogui actions to complete the task.\n <image> Please generate the next move according to the ui screenshot, instruction and previous actions. 

Instruction: {overall_goal} 

Previous actions: {previous_actions}

Please generate the natural language description of the current action in <action></action> and the specific execution action in <tool_call></tool_call>"
"""
TEMPLATES['QWEN2VL_Llama']=[LLAMA_SYS,LLAMA_USER,OUTPUT_FORMAT,qwen2vl_template]



# TODO: Qwen2vl-llamafactory-format
QWEN2VL_Llama_Format_SYS=AGUVIS_SYS
QWEN2VL_Llama_Format_USER="""
Please generate the next move according to the ui screenshot, instruction and previous actions.

Instruction: {overall_goal} 

Previous actions: {previous_actions}

Please generate the observation of the current screen in <observation></observation>, the thinking of the current action in <thinking></thinking>, and the natural language description of the current action in <action></action>. Finally, the specific execution action is generated in <tool_call></tool_call>
"""
TEMPLATES['QWEN2VL_Llama_Format']=[QWEN2VL_Llama_Format_SYS,QWEN2VL_Llama_Format_USER,OUTPUT_FORMAT,qwen2vl_template]
