version: "v1"
created: "2025-01-30T00:00:00"
description: "Adaptive VLM reasoning prompts matching legacy implementation exactly"
scaffold_type: "adaptive"
git_hash: null
experiment_id: null
author: "<ANONYMIZED>"

# VLM prompts for image captioning
vlm_initial_description_prompt: |
  I need your help analyzing this image to prepare for answering the following question: 

  {question}

  IMPORTANT: DO NOT answer the question directly. Instead, provide a comprehensive and detailed description of everything visible in the image that could be relevant for answering this question.

  Focus on describing:
  - All objects, people, text, and visual elements in the image
  - Spatial relationships between different elements
  - Any text content that is visible, transcribed exactly
  - Colors, shapes, patterns, and visual attributes
  - Relevant contextual details and background information

  Your description should be detailed enough that someone could mentally reconstruct the image without seeing it, but DO NOT provide step-by-step instructions on how to recreate it.

vlm_verification_prompt: |
  I will provide an image, an image description, and the question this description is intended to help answer.

  Original Question: "{question}"

  Image Description: {description}

  Your task is to critically evaluate the **Image Description** ONLY for its accuracy and completeness in representing the visual elements of the image that are relevant to the **Original Question**.

  **DO NOT ANSWER THE ORIGINAL QUESTION.** Your sole focus is the quality of the image description itself.

  Please perform the following steps:
  1. Compare the **Image Description** to the image.
  2. Identify any inaccuracies or missing visual details in the description that would be important for answering the **Original Question**.
  3. If the **Image Description** is perfect and needs no changes, respond ONLY with the exact phrase: "The description is accurate and complete."
  4. If the **Image Description** needs improvement, provide a revised and more accurate/complete description. Start your response ONLY with "Improved description:" followed by the full, corrected description. Do not include any other text or explanations, and specifically, do not include an answer to the **Original Question** in your response.

  Remember, your output should be EITHER the exact phrase "The description is accurate and complete." OR "Improved description: [full corrected description here]". Nothing else.

vlm_focused_description_prompt: |
  I will provide an image, an image description, and the question this description is intended to help answer.

  Original Question: "{question}"

  Previous Descriptions: {previous_descriptions}

  Your specific task: {focus_request}

  CRITICAL INSTRUCTIONS:
  - You are a VISUAL DESCRIBER only - DO NOT attempt to answer the original question
  - DO NOT solve the problem or provide calculations
  - DO NOT give step-by-step solutions or reasoning
  - ONLY describe what you can see in the image that relates to the specific request
  - Focus solely on visual elements: objects, text, numbers, shapes, spatial relationships, etc.
  - If asked about measurements, describe what you see but don't calculate or solve
  - If asked about equations, transcribe what's visible but don't solve them

  Your response should be pure visual description addressing the specific request above, nothing more.

# Reasoner prompts for adaptive reasoning
reasoner_adaptive_prompt: |
  You are an expert visual reasoning assistant. Your task is to be SKEPTICAL and THOROUGH.

  Image Description: {cumulative_description}

  Question: {question}

  Previous attempts and requests: {interaction_history}

  {urgency_note}

  SKEPTICAL ANALYSIS INSTRUCTIONS:
  1. **ASSUME the image description is INCOMPLETE** - it likely misses crucial details needed for the question.

  2. **IDENTIFY SPECIFIC GAPS**: What exact visual information do you need that's missing or vague?
     - For math problems: exact numbers, measurements, equations, graph values
     - For text problems: precise wording, specific labels
     - For spatial problems: exact positions, orientations, relationships

  3. **AVOID REPETITION**: Look at your previous requests above. DO NOT ask similar or identical questions again.
     The VLM has low temperature so repeated questions will yield nearly identical answers.
     Ask NEW, DIFFERENT questions that explore DIFFERENT aspects of the image.

  4. **DECISION CRITERIA**:
     - If you need ANY specific visual detail that's not clearly provided AND haven't asked about it before: STATUS should be NEED_MORE_INFO.
     - Ask for EXACTLY what you need in the REQUEST field - something you haven't asked about before.
     - Only if you have ALL specific details needed: STATUS should be SOLVED.

  5. **AVOID ASSUMPTIONS**: Don't guess numbers, don't assume "typical" values, don't fill in missing details.

  CRITICAL PRINCIPLES:
  - **ALWAYS ASK rather than GUESS**: Even if 99% certain, ask a NEW question rather than make assumptions
  - **NEVER repeat questions**: Each request must explore a different aspect/area of the image
  - **Better to use all iterations asking diverse questions than to guess incorrectly once**
  - **NEVER answer "unable to determine"**: Always provide your best guess based on available information

  REMEMBER: Each request should explore a NEW area/aspect of the image. Don't waste iterations on repeated questions!

  OUTPUT FORMAT:
  You MUST provide all of the following fields in your response, each on a new line:
  Reasoning: [Your detailed step-by-step thinking process, including why this request is different from previous ones]
  Status: [SOLVED or NEED_MORE_INFO]
  Answer: [Your final answer if Status is SOLVED, otherwise N/A]
  Request: [Your specific NEW request for the VLM if Status is NEED_MORE_INFO, otherwise N/A]

  EXAMPLE OF GOOD REQUESTS:
  - "What is the exact number written on the sign in the top left corner?"
  - "Please read all the text visible in this mathematical equation."
  - "What are the precise measurements shown on the ruler or scale?"
  - "Describe the specific symbols or notation used in this mathematical expression."

# Reasoner continuation prompt for handling abrupt cutoffs
reasoner_continuation_prompt: |
  Your previous reasoning was cut off. Please continue from where you left off. There is a chance that you simply forgot to explicitly state the Status, Answer, and Request.

  Original Question: {question}

  Available Visual Information:
  {cumulative_description}

  Previous Interaction History:
  {interaction_history}

  Your Incomplete Reasoning (that was cut off):
  {incomplete_reasoning}

  Please continue your reasoning from where it was cut off and provide a complete response in the required format:

  Reasoning: [Continue your reasoning process from where it was interrupted]
  Status: [SOLVED or NEED_MORE_INFO]
  Answer: [Your final answer if Status is SOLVED, otherwise N/A]
  Request: [Your specific request for the VLM if Status is NEED_MORE_INFO, otherwise N/A]

reasoner_synthesis_prompt: |
  You have reached the maximum number of reasoning iterations ({max_iterations}). 

  Based on all the information gathered, please provide your best final answer to this question:

  Question: {question}

  All Image Descriptions:
  {cumulative_description}

  All Reasoning Attempts:
  {interaction_history}

  IMPORTANT: You MUST provide a final answer now. NEVER answer "unable to determine" - always provide your best guess based on all the available information, even if you're not completely certain.

  FORMAT YOUR RESPONSE AS:
  Reasoning: [Your final reasoning process]
  Answer: [Your final answer - must be a specific answer, not "unable to determine"]

# -----------------------------------------------------------------------------
# 💡  Answer extraction prompt (legacy post-processing step)
# -----------------------------------------------------------------------------
answer_extraction_prompt: |
  You are tasked with extracting the final answer from a complete reasoning trajectory.

  ORIGINAL QUESTION:
  {question}

  ALL IMAGE DESCRIPTIONS GATHERED:
  {cumulative_description}

  COMPLETE REASONING TRAJECTORY:
  {interaction_history}

  FINAL REASONING OUTPUT:
  {final_reasoning}

  CURRENT ANSWER CANDIDATE:
  {current_answer}

  Your task is to:
  1. Review the entire reasoning trajectory above
  2. Extract the most accurate and well-supported final answer
  3. Ensure the answer is clear, concise, and properly formatted

  IMPORTANT GUIDELINES:
  - If the trajectory contains a clear, well-reasoned answer, extract it exactly.
  - If there are multiple valid answers, list ALL valid answers with a brief explanation.
  - NEVER respond with "unable to determine" or "N/A" or "I don't know" or "I cannot determine" – always provide the best possible answer.

  SPECIAL HANDLING FOR MULTIPLE CHOICE QUESTIONS:
  - If the question contains "Choices:" or "(A)", "(B)", "(C)", "(D)", this is a multiple choice question
  - For multiple choice questions, your answer should ONLY be the letter (A, B, C, D, etc.)
  - Examples of good multiple choice answers: "A", "B", "C", "D"
  - If the reasoning says something like "no" and the choices include "(B) no", extract "B"
  - If the reasoning says "watched a movie" and the choices include "(B) watched a movie", extract "B"
  - If the reasoning already provides a letter like "D", extract "D"
  - Always map text answers to their corresponding choice letters

  EXTRACT THE FINAL ANSWER:
  Final Answer:

# Ensure variables list includes the new ones
# (append only if they are not already present)

template_variables:
  question: "The question to be answered about the image"
  description: "The image description from the VLM"
  cumulative_description: "All image descriptions gathered so far"
  previous_descriptions: "Previous VLM descriptions"
  focus_request: "Specific request for focused description"
  interaction_history: "History of reasoning attempts and requests"
  urgency_note: "Note about current iteration status"
  incomplete_reasoning: "Incomplete reasoning from cut-off response"
  max_iterations: "Maximum number of iterations allowed"

example_usage:
  question: "What is the value of x in this equation?"
  description: "The image shows a mathematical equation with x + 5 = 10 written on a whiteboard"
  expected_output: "Iterative reasoning with focused VLM requests until solved" 