# Centralized Prompt Configuration for Model Evaluation System
# This file contains all prompts used during the evaluation process

# System prompt for the mathematical problem-solving assistant
system_prompt: |
  # Background
  The IMProofBench project is a mathematical reasoning benchmark for AI systems, testing their ability to solve research level math problems. Each such problem consists of one **main question**, where the expected answer is a longform mathematical proof, and several related **subquestions** which have short, unique answers (e.g. a natural number). The main answer will be graded by both human expert mathematicians (often the author of the question) and AI evaluators, whereas subquestion answers are checked automatically using a Python script.
  
  # Structure of the evaluation
  In the following we would like to evaluate your mathematical reasoning abilities on one such problem. The overall structure of the conversation below is that we iterate through the questions in order (main question, sub-question 1, sub-question 2, ...) and in each step, you can:
  
  - Read the current question
  - Think about it in a multi-turn environment with tool use (see below)
  - Submit the answer to the current question
  
  At each point in the conversation, you have the context of the entire previous conversation including your outputs in the thinking steps and the record of any tool uses. Note that you will *not necessarily* have access to records of your internal reasoning traces and internal tool uses, so any helpful information from these should be documented in your (external) thinking outputs. 
  
  # Multi-turn reasoning environment
  To help you solve the problem, you will have access to a multi-turn conversation environment with optional tool use, based on the Inspect AI framework. At each step, you can:
  
  - Think out loud to analyze the problem, devise a solution approach, think through the steps of mathematical arguments, etc.
  - Use the `python` tool to run self-contained experiments in a standard python environment
  - Use the `bash` tool to execute commands inside a docker container (running ArchLinux with some open-source mathematical software installed)
  - Use the `web_search` tool to search for current information, mathematical definitions, theorems, or recent research
  - Use the `sage_computation` tool for conducting an experiment in a self-contained SageMath terminal session
  - Use the `submit` tool to provide your final answer to the current question (main or sub-question)
  
  All tools have a timeout of 15 minutes, maximal memory usage (RAM) of 8 GB and run on standard 2025 hardware.
  
  # Token constraints
  You have {main_question_token_limit:,} tokens to solve the main question, and {subquestion_token_limit:,} tokens for each of the following sub-questions. This counts both your output tokens (including in tool calls) and your reasoning tokens. You are informed about your current usage after each conversation turn.
  
  # Answer format for main question
  Below you will see the text of the main question. Once you finished reasoning about it, you can register your answer using the `submit` tool. The answer for the main question should be a detailed mathematical argument, formatted in Markdown with LaTeX formulas using $...$ for inline mathematical expressions and $$...$$ for equations. Use Markdown [link formatting](https://www.markdownguide.org/basic-syntax/#links) for including online references, *not* any internal web-referencing system.

# Submit tool response messages
submit_responses:
  # Response when answer is submitted with more stages to follow
  stage_success: |
    ✅ **Answer submitted successfully for {stage_name}**
    
    {answer_delimiter_start}
    {answer}
    {answer_delimiter_end}
    
    ---
    
    **Next Stage:**
    
    {next_stage_content}
    
    **Instructions:** Solve this question step by step and use submit() when you have your final answer.

  # Response when final answer is submitted
  final_success: |
    ✅ **Final answer submitted successfully**
    
    {answer_delimiter_start}
    {answer}
    {answer_delimiter_end}
    
    🎉 **Evaluation Complete!** 
    
    You have successfully completed all parts of this mathematical problem. All your answers have been recorded for evaluation.

  # Error response when submission fails
  submission_error: |
    ❌ **Submission Error**
    
    Failed to submit answer: {error}
    
    Please ensure your answer is in a valid format and try again.
    Common formats: numbers (42), strings ("hello"), lists ([1, 2, 3]), etc.

# Subquestion transition prompts
subquestion_transition:
  # Prompt template for transitioning to a new subquestion
  template: |
    ---
    
    **Great work on the previous part!** 
    
    You have successfully completed the previous question. Now please solve the following subquestion while keeping the context of your previous work:
    
    **Subquestion {subquestion_order}:**
    {subquestion_text}
    
    **Instructions:**
    - You can reference your work from previous parts
    - Use the same mathematical tools available to you  
    - When you have your final answer, use the submit() tool to submit it
    - Be precise and specific in your answer format
    
    Please proceed with solving this subquestion.

# Stage names for display (used in submit tool responses)
stage_display_names:
  main_question: "Main Question"
  subquestion_a: "Subquestion A"
  subquestion_b: "Subquestion B"
  subquestion_c: "Subquestion C"
  subquestion_d: "Subquestion D"
  subquestion_e: "Subquestion E"

# Continue prompt when model doesn't make tool calls
continue_prompt: |
  Please continue working on the current question. To formally register your answer, use the `submit` tool as per the original instructions above. Note: the conversation will only proceed to the next stage once you use the `submit` tool.

# Continue prompt with token usage information
continue_prompt_with_tokens: |
  Please continue working on the current question. To formally register your answer, use the `submit` tool as per the original instructions above. Note: the conversation will only proceed to the next stage once you use the `submit` tool.

  Token usage: {current_tokens:,} of {token_limit:,} tokens used for this stage.

# Tool emulation reminders (for models without native tool calling support)
tool_emulation_reminders:
  # Reminder about submit tool syntax for tool emulation models
  submit_syntax: |

    Reminder: When you're ready to submit your final answer, use the submit tool with this exact format:
    <tool_call>{"name": "submit", "arguments": {"answer": "your answer here"}}</tool_call>

    Make sure to include the closing </tool_call> tag.

# Evaluation transition prompts (used in evaluator.py)
evaluation_transitions:
  # Prompt for transitioning from main question to first subquestion  
  main_to_subquestion: |
    Great! Your answer for the main question has been recorded.

    Now, please solve the following subquestion:

    **Subquestion {subquestion_order}:**

    {subquestion_text}

    **Answer Format:** Provide your answer as a simple value. For example: `42`, `[1,2,3]`, `x^2+1`, `True`, etc.
  
  # Prompt for transitioning between subquestions
  subquestion_to_subquestion: |
    Excellent! Your answer for subquestion {current_subquestion} has been recorded.

    Now, please solve the next subquestion:

    **Subquestion {next_subquestion}:**

    {subquestion_text}

    **Answer Format:** Provide your answer as a simple value. For example: `42`, `[1,2,3]`, `x^2+1`, `True`, etc.

# Final round prompt when token limit is exceeded
final_round_prompt: |
  ⚠️ TOKEN LIMIT REACHED - FINAL SUBMISSION REQUIRED

  You have exhausted your token allocation for this evaluation.
  You must submit your answer NOW.

  Use the `submit` tool with your final answer - it will be recorded as your submission.
  If you do not use the `submit` tool, your text response will be taken as your final answer.

  Submit your best answer based on your work so far.

# Malformed submit confirmation prompts
malformed_submit_confirmations:
  # XML with malformed JSON (e.g., r'''...''' syntax)
  xml_malformed_json: |
    The system has detected a malformed submit tool call (invalid JSON inside XML tags). The following string was extracted:

    {answer_preview}

    The above string will be recorded as your submission unless your answer to this prompt contains the string ABORT_SUBMISSION. If you wish to proceed with the submission, please reply with a short acknowledgement.

  # Raw JSON without XML tags
  raw_json_no_xml: |
    The system has detected a malformed submit tool call (missing XML tags). The following string was extracted:

    {answer_preview}

    The above string will be recorded as your submission unless your answer to this prompt contains the string ABORT_SUBMISSION. If you wish to proceed with the submission, please reply with a short acknowledgement.

  # Incomplete XML (missing closing tag)
  incomplete_xml: |
    The system has detected an incomplete submit tool call (missing closing tag). The following string was extracted:

    {answer_preview}

    The above string will be recorded as your submission unless your answer to this prompt contains the string ABORT_SUBMISSION. If you wish to proceed with the submission, please reply with a short acknowledgement.

  # Recognized tool call with malformed JSON
  tool_call_malformed_json: |
    The system has detected a malformed call to the submit tool. The following string was extracted from this call:

    {answer_preview}

    The above string will be recorded as your submission unless your answer to this prompt contains the string ABORT_SUBMISSION. If you wish to proceed with the submission, please reply with a short acknowledgement.

# Configuration for answer delimiters
answer_delimiters:
  # Prefix and suffix for answer delimiters (hash will be appended)
  prefix: "<<<ANSWER_START_"
  suffix: ">>>"
  end_prefix: "<<<ANSWER_END_"
  end_suffix: ">>>"

# Tool descriptions for AI models
tool_descriptions:
  python:
    name: "python"
    description: |
      Use the python function to execute Python code.

      The Python tool executes single-run Python scripts. Important notes:
      1. Each execution is independent - no state is preserved between runs
      2. You must explicitly use print() statements to see any output
      3. Simply writing expressions (like in notebooks) will not display results
      4. The script cannot accept interactive input during execution
      5. Return statements alone won't produce visible output
      6. All variables and imports are cleared between executions
      7. Standard output (via print()) is the only way to see results
      8. This tool has a timeout of 15 minutes and maximal memory usage (RAM) of 8 GB
    parameters:
      code:
        type: "string"
        description: "The python code to execute."
        
  bash:
    name: "bash"
    description: |
      Use this function to execute bash commands. Underlying system is ArchLinux with many standard open-source computer algebra systems (like GAP) pre-installed.
      This tool has a timeout of 15 minutes and maximal memory usage (RAM) of 8 GB.
    parameters:
      cmd:
        type: "string"
        description: "The bash command to execute."
        
  web_search:
    name: "web_search"
    description: |
      Use this function to search the web for current information, mathematical definitions, theorems, or recent research.
      
      This tool gives you access to up-to-date information that can help with:
      - Looking up mathematical definitions and theorems
      - Finding recent research papers or results
      - Verifying computational results against known databases
      - Checking current mathematical conventions or notation
      - Finding examples of similar problems or techniques
      
      The search results will include titles, URLs, and relevant excerpts from web pages.
      Use this tool when you need information that might not be in your training data or when you want to verify facts.
    parameters:
      query:
        type: "string"
        description: "The search query to look up information on the web."
        
  sage_computation:
    name: "sage_computation"
    description: |
      Use the sage_computation function to run calculation in the open-source mathematics 
      software system SageMath.

      The sage_computation tool executes single-run SageMath scripts. Important notes:
      1. Each execution is independent - no state is preserved between runs
      2. You must explicitly use print() statements to see any output
      3. Simply writing expressions (like in notebooks) will not display results
      4. The script cannot accept interactive input during execution
      5. Return statements alone won't produce visible output
      6. All variables and imports are cleared between executions
      7. Standard output (via print()) is the only way to see results
      8. This tool has a timeout of 15 minutes and maximal memory usage (RAM) of 8 GB
      
      All standard SageMath functions are pre-imported and available.
      The SageMath preparser is applied, so you can use natural mathematical syntax.
      
      Key Features:
      - Natural syntax: Use x^2 for powers, K.<a> for field extensions
      - All mathematical objects pre-imported: Matrix, EllipticCurve, PolynomialRing, etc.
      - Advanced packages available: admcycles for moduli spaces, and many more
      
      Examples:
          # Factor a polynomial
          factor(x^100 - 1)
          
          # Define a number field
          K.<a> = NumberField(x^3 - 2)
          
          # Work with elliptic curves
          E = EllipticCurve([0, 1])
          print(E.rank())
          
          # Use specialized packages (example with admcycles)
          from admcycles import *
          G = StableGraph([1,1],[[1,3],[2,4]],[(1,2),(3,4)])
          print(f"Automorphisms^2: {G.automorphism_number()^2}")
      
      IMPORTANT: Like the python() tool, you must use print() to see any output.
      Nothing is returned automatically - always print your results!
    parameters:
      sage_code:
        type: "string"
        description: "SageMath code using natural mathematical syntax (x^2, K.<a>, etc.)"
        
  submit:
    name: "submit"
    description: |
      Submit your final answer for the current question or subquestion. Use Markdown + LaTeX formatting.
      The answer for the main question should be a detailed mathematical argument.
      
      Your answer should be formatted as natural Markdown text with LaTeX formulas.
      Use $ for inline math and $$ for display math, or \begin{equation} environments.
      Use standard [Markdown link syntax](https://www.markdownguide.org/basic-syntax/#links) for online references.
      
      RECOMMENDED: Use raw strings (r''' or r"") to write LaTeX naturally without escaping.
      
      Important formatting notes:
      - Write your answer exactly as you would in a math document
      - Use raw triple quotes r''' for multiline answers with LaTeX
      - This lets you write \frac, \sqrt, \int naturally (no escaping needed)
      - Include full mathematical reasoning with the final answer clearly stated
      - Do not use custom macros (e.g., \Z, \Q, \RR, etc.). Only use valid standard LaTeX commands
    parameters:
      answer:
        type: "string"
        description: "Your final answer as Markdown text with LaTeX formulas"
