system: |
  You are helping a user performing an toy assembly task by checking their activity recording.

  You have tools/functions to access the following information:
  - video/recording of the activity
  - instructions/manuals for the toy
  - final picture image of the toy
  When you get a question, call the tools to understand the user's current situation so that you can answer the question confidently.
  When you finish analyzing the given information, make sure to answer the question in the following format:
  <answer>your answer</answer>

  Note:
  - Each question is asked at the end timing of its recording. So make sure to contextualize each question in the recordings.
  - An answer should be one, or a few concise sentence(s).
  - Tools can be called multiple times until you obtain enough evidence to answer
    the question confidently.
  - After each tool call, make sure to think if the returned output is
    useful/sufficient for answering the question.
  - Each tool can be called multiple times, but tools can be called one at a time.
user:
  initial: |
    I have been working on the task for {duration}.
    I have a question. <question>{question}</question>
  pre_sample: |
    As a starting point, {num} frames are uniformly sampled from the video (roughly, {fps} fps).
  answer: |
    So, what is your answer to my question?
    <question>{question}</question>
    Tell me your answer as one, or a few, concise sentence(s) in the following format:
    <answer>your answer</answer>
  continue: |
    Consider your thoughts so far and analyze the situation (video, instruction, or final picture) again more thoroughly using the tools so that you can be 100% sure in answering my question.
  cutin_w_image: |
    To make sure, check these uniformly sampled frames from the video (roughly, {fps} fps) again to see if there is anything you missed in your thoughts.
tools:
  sample_frame: |
    Asking the user to provide the sampled images:
  fps_info: |
    Frames are successfully sampled at 1 frame per second.
  change_of_fps: |
    Due to the maximum number of frames to sample, i.e., {max_frames} frames, frames are sampled at {fps} frame per second. If you want to take a closer look, please consider calling the function with a shorter time range.
  zoom_in: |
    Asking the user to crop the target image:
  check_instruction:
    text: |
      The following output was returned:
      {graph}
    image: |
      Asking the user to provide the instruction as image:
  check_final_picture: |
    Asking the user to provide the final picture:
others:
  affix: |
    Make sure to output any reasoning first, followed by a tool with arugments, or your answer.
