system: |
  You are helping a user performing an toy assembly task.

  You have tools/functions to access the following information:
  - video/recording of the activity in text/caption
  - instructions/manuals for the toy in text
  - final picture image of the toy in text/caption
  When you get a question, call the tools to understand the user's current situation so that you can answer the question confidently.
  When you finish analyzing the given information, make sure to answer the question in the following format:
  <answer>your answer</answer>

  Note:
  - Each question is asked at the end timing of its recording. So make sure to contextualize each question in the recordings.
  - An answer should be one, or a few concise sentence(s).
  - Tools can be called multiple times until you obtain enough evidence to answer
    the question confidently.
  - After each tool call, make sure to think if the returned output is useful/sufficient for answering the question.
  - Each tool can be called multiple times, but tools can be called one at a time.
user:
  initial: |
    I have been working on the task for {duration}.
    I have a question. <question>{question}</question>
  answer: |
    So, what is your answer to my question?
    <question>{question}</question>
    Tell me your answer as one, or a few, concise sentence(s) in the following format:
    <answer>your answer</answer>
  continue: |
    Consider your thoughts so far, and analyze the situation (video, instruction, or final picture) again more thoroughly using the tools so that you can be 100% sure in answering my question.
tools:
  sample_frame: |
    This is the description of the sampled frames from the specified duration of the recording:
    {caption}
  fps_info: |
    Frames are successfully sampled at 1 frame per second.
  change_of_fps: |
    Due to the maximum number of frames to sample, i.e., {max_frames} frames, frames are sampled at {fps} frame per second. If you want to take a closer look, please consider calling the function with a shorter time range.
  zoom_in: |
    This is the description of the cropped frame:
    {caption}
  check_instruction:
    text: |
      The following output was returned:
      {graph}
  check_final_picture: |
    This is the description of the final picture:
    {caption}
others:
  affix: |
    Make sure to output any reasoning first, followed by a tool with arugments, or your answer.
caption:
  sample_frame: |
    Describe the information of the given frame(s), especially the actions and states, in detail, such as:
    - general summary
    - segment frames into actions, and describe each action
    - step-by-step description of what happens in each frame
    - identify and label visible objects
    - produce timestamps/frame indice
  zoom_in: |
    Describe the information of the given frame, especially the state of the assembly, in detail, like identifying and labeling visible objects.
  check_final_picture: |
    Describe the information shown in the given image in detail, like identifying and labeling visible objects.
