prompt: |
  Your task is to write a proof solution to the following problem. Your proof will be graded by human judges for accuracy, thoroughness, and clarity. When you write your proof, follow these guidelines:

  - You are creating a proof, not a proof outline. Each step should be carefully explained and documented. If not properly explained, the judge will assume that you cannot explain it, and therefore decrease your grade.
  - You can use general theorems and lemmas, but only if they are well-known. As a rule of thumb: if the result has a name and is famous enough to have a Wikipedia page or something similar to describe it, it is allowed. Any result from papers that would not be taught in high-school or low-level bachelor courses in mathematics should not be used. Any use of such results will immediately give you a zero grade.
  - Do not skip computation steps in your proof. Clearly explain what transformations were done and why they are allowed in each step of a calculation.
  - You should use correct LaTeX notation to write equations and mathematical symbols. You should encompass these equations in appropriate symbols ("\\(" and "\\)" for inline math, "\\[" and "\\]" for block math) to enhance the clarity of your proof. Do not use any unicode characters.
  - Your proof should be self-contained.
  - If you are not sure about a specific step, or do not know how to prove an intermediate result, clearly state this. It is much preferable to indicate your uncertainty rather than making incorrect statements or claims.

  {problem}
problem_regexes:
  - IMOSL2/202.*
  - BMOSL2/202.*
  - USAMO2/202.*
  - IMOSL2/201.*
  - BMOSL2/201.*
  - USAMO2/201.*
  - IMOSL2/200.*
  - USAMO2/200.*
  - IMOSL2/1.*
  - USAMO2/1.*
allow_shuffle: false
overwrite: false

type: "best_of_n"
attempt_key: "attempts_singular"

n_solutions: 8

judges:
  - name: "Random Judge"
    mode: random
  - name: "Discrete Judge"
    mode: discrete
    correct_word: "correct"
    judge: openai/o4-mini--high
    prompt: | 
      You are judging the correctness of an LLM-generated proof for a math problem.

      ### Input:

      Your input will consist of the following components:
      - **Problem Statement**: A mathematical problem that the proof is attempting to solve.
      - **Proof Solution**: The proof that you need to evaluate. This proof may contain errors, omissions, or unclear steps. The proof was generated by another language model, which was given the following instructions:
      <model_prompt>
      - You are creating a proof, not a proof outline. Each step should be carefully explained and documented. If not properly explained, the judge will assume that you cannot explain it, and therefore decrease your grade.
      - You can use general theorems and lemmas, but only if they are well-known. As a rule of thumb: if the result has a name and is famous enough to have a Wikipedia page or something similar to describe it, it is allowed. Any result from papers that would not be taught in high-school or low-level bachelor courses in mathematics should not be used. Any use of such results will immediately give you a zero grade.
      - Do not skip computation steps in your proof. Clearly explain what transformations were done and why they are allowed in each step of a calculation.
      - You should use correct LaTeX notation to write equations and mathematical symbols. You should encompass these equations in appropriate symbols ("\\(" and "\\)" for inline math, "\\[" and "\\]" for block math) to enhance the clarity of your proof. Do not use any unicode characters.
      - Your proof should be self-contained.
      - If you are not sure about a specific step, or do not know how to prove an intermediate result, clearly state this. It is much preferable to indicate your uncertainty rather than making incorrect statements or claims.
      </model_prompt>

      ### How the solution should be graded:
      A solution should be considered correct even if it would earn 5+/7 points in a standard grading format. Examples of small penalties worth 1 point are if the solution:
      - Makes a small computational mistake that can be easily fixed
      - Misses an edge case which can be easily proven/disproven
      - Skips over a step that follows without much reasoning or manual work
      Depending on the severity and the context, you may also not penalise a given error. On the other hand, a solution should be marked as incorrect if:
      - It marks a step as trivial, if it is not immediately obvious with little reasoning why this would be the case.
      - It omits algebra-heavy computational steps, regardless of whether or not it has outlined the methodology. Skipping shorter computations should be permitted.
      - Generalizes over a pattern without rigorously describing the pattern, or without proving any relevant properties.
      - It cites a non-existing or unpopular source/Theorem, which cannot be immediately found from searching for it online. Thus, any theorems that can be immediately found and have a Wikipedia article are allowed.

      ### Further Potential Issues:

      Here are some common types of issues to look for:
      - **Overgeneralization**: The generated proof proceeds by proving the problem in one or more specific cases, and then concludes that the result holds in general. However, it does not provide a proof for the general case.
      - **Oversimplification**: The proof marks steps as trivial or obvious without proper justification.
      - **Skipping Computation Steps**: Proofs that skip computation steps or do not explain transformations clearly can lead to misunderstandings.
      - **Citing Non-Standard Works or Theorems**: Some models may cite theorems or results that are not well-known or are not typically taught in high-school or low-level bachelor courses. Such theorems are only allowed if they are well known.
      - **Missing Edge Cases**: The proof may not consider all possible cases or edge cases.

      The model has been specifically told that it should not skip steps or mark them as trivial. Any violation of this rule should be considered by assuming the model does not know how to derive the "trivial" step.

      ### Scoring instructions

      If you believe the proof is correct, end your analysis with \\boxed{{correct}}. If you believe the proof is incorrect, end your analysis with \\boxed{{incorrect}}.

      ### Problem Statement:
      {problem}

      ### Model Solution:
      {solution}
  - name: "Continuous Judge"
    mode: continuous
    judge: openai/o4-mini--high
    prompt: |
      You are judging the correctness of an LLM-generated proof for a math problem.

      ### Input:

      Your input will consist of the following components:
      - **Problem Statement**: A mathematical problem that the proof is attempting to solve.
      - **Proof Solution**: The proof that you need to evaluate. This proof may contain errors, omissions, or unclear steps. The proof was generated by another language model, which was given the following instructions:
      <model_prompt>
      - You are creating a proof, not a proof outline. Each step should be carefully explained and documented. If not properly explained, the judge will assume that you cannot explain it, and therefore decrease your grade.
      - You can use general theorems and lemmas, but only if they are well-known. As a rule of thumb: if the result has a name and is famous enough to have a Wikipedia page or something similar to describe it, it is allowed. Any result from papers that would not be taught in high-school or low-level bachelor courses in mathematics should not be used. Any use of such results will immediately give you a zero grade.
      - Do not skip computation steps in your proof. Clearly explain what transformations were done and why they are allowed in each step of a calculation.
      - You should use correct LaTeX notation to write equations and mathematical symbols. You should encompass these equations in appropriate symbols ("\\(" and "\\)" for inline math, "\\[" and "\\]" for block math) to enhance the clarity of your proof. Do not use any unicode characters.
      - Your proof should be self-contained.
      - If you are not sure about a specific step, or do not know how to prove an intermediate result, clearly state this. It is much preferable to indicate your uncertainty rather than making incorrect statements or claims.
      </model_prompt>

      ### How the solution should be graded:
      A solution should be graded out of a total of 7 points. Examples of small penalties worth 1 point are if the solution:
      - Makes a small computational mistake that can be easily fixed
      - Misses an edge case which can be easily proven/disproven
      - Skips over a step that follows without much reasoning or manual work
      Depending on the severity and the context, you may also not penalise a given error. On the other hand, a solution should receive a very poor grade if:
      - It marks a step as trivial, if it is not immediately obvious with little reasoning why this would be the case.
      - It omits algebra-heavy computational steps, regardless of whether or not it has outlined the methodology. Skipping shorter computations should be permitted.
      - Generalizes over a pattern without rigorously describing the pattern, or without proving any relevant properties.
      - It cites a non-existing or unpopular source/Theorem, which cannot be immediately found from searching for it online. Thus, any theorems that can be immediately found and have a Wikipedia article are allowed.

      The model has been specifically told that it should not skip steps or mark them as trivial. Any violation of this rule should be considered by assuming the model does not know how to derive the "trivial" step.

      ### Further Potential Issues:

      Here are some common types of issues to look for:
      - **Overgeneralization**: The generated proof proceeds by proving the problem in one or more specific cases, and then concludes that the result holds in general. However, it does not provide a proof for the general case.
      - **Oversimplification**: The proof marks steps as trivial or obvious without proper justification.
      - **Skipping Computation Steps**: Proofs that skip computation steps or do not explain transformations clearly can lead to misunderstandings.
      - **Citing Non-Standard Works or Theorems**: Some models may cite theorems or results that are not well-known or are not typically taught in high-school or low-level bachelor courses. Such theorems are only allowed if they are well known.
      - **Missing Edge Cases**: The proof may not consider all possible cases or edge cases.

      ### Scoring instructions

      Your score should be a number between 0 and 7, where 0 means the proof is completely incorrect, and 7 means the proof is completely correct. Be very critical in your grading. If you find small errors, deduct points accordingly.

      ### Output Format:

      At the end of your analysis, present your grade as a number between 0 and 7 in "$\boxed{{}}$".

      ### Problem Statement:
      {problem}

      ### Model Solution:
      {solution}
  - name: "Ranking Judge"
    mode: dual
    judge: openai/o4-mini--high
    tournament_type: "swiss"
    tournament_config:
      # max_losses: 2
      n_rounds: 7
      easy_optimization: false
      round_robin: false
      correct_length_bias: false
      correct_position_bias: false
      first_win_word: "A"
      second_win_word: "B"
    prompt: |
      You are judging which of the two LLM-generated proofs for a given math problem is better.

      ### Input:

      Your input will consist of the following components:
      - **Problem Statement**: A mathematical problem that the proof is attempting to solve.
      - **Proof Solution A/B**: The proofs that you need to evaluate. This proof may contain errors, omissions, or unclear steps. Proofs were generated by another language model, which was given the following instructions:
      <model_prompt>
      - You are creating a proof, not a proof outline. Each step should be carefully explained and documented. If not properly explained, the judge will assume that you cannot explain it, and therefore decrease your grade.
      - You can use general theorems and lemmas, but only if they are well-known. As a rule of thumb: if the result has a name and is famous enough to have a Wikipedia page or something similar to describe it, it is allowed. Any result from papers that would not be taught in high-school or low-level bachelor courses in mathematics should not be used. Any use of such results will immediately give you a zero grade.
      - Do not skip computation steps in your proof. Clearly explain what transformations were done and why they are allowed in each step of a calculation.
      - You should use correct LaTeX notation to write equations and mathematical symbols. You should encompass these equations in appropriate symbols ("\\(" and "\\)" for inline math, "\\[" and "\\]" for block math) to enhance the clarity of your proof. Do not use any unicode characters.
      - Your proof should be self-contained.
      - If you are not sure about a specific step, or do not know how to prove an intermediate result, clearly state this. It is much preferable to indicate your uncertainty rather than making incorrect statements or claims.
      </model_prompt>

      ### How the solution should be graded:
      The following examples are small mistakes that should only be slightly penalised:
      - Makes a small computational mistake that can be easily fixed
      - Misses an edge case which can be easily proven/disproven
      - Skips over a step that follows without much reasoning or manual work
      On the other hand, a solution should should be severely penalised if:
      - It marks a step as trivial, if it is not immediately obvious with little reasoning why this would be the case.
      - It omits algebra-heavy computational steps, regardless of whether or not it has outlined the methodology. Skipping shorter computations should be permitted.
      - Generalizes over a pattern without rigorously describing the pattern, or without proving any relevant properties.
      - It cites a non-existing or unpopular source/Theorem, which cannot be immediately found from searching for it online. Thus, any theorems that can be immediately found and have a Wikipedia article are allowed.

      The model has been specifically told that it should not skip steps or mark them as trivial. Any violation of this rule should be considered by assuming the model does not know how to derive the "trivial" step.

      ### Further Potential Issues:

      Here are some common types of issues to look for:
      - **Overgeneralization**: The generated proof proceeds by proving the problem in one or more specific cases, and then concludes that the result holds in general. However, it does not provide a proof for the general case.
      - **Oversimplification**: The proof marks steps as trivial or obvious without proper justification.
      - **Skipping Computation Steps**: Proofs that skip computation steps or do not explain transformations clearly can lead to misunderstandings.
      - **Citing Non-Standard Works or Theorems**: Some models may cite theorems or results that are not well-known or are not typically taught in high-school or low-level bachelor courses. Such theorems are only allowed if they are well known.
      - **Missing Edge Cases**: The proof may not consider all possible cases or edge cases.

      ### Scoring instructions

      You should compare the two proofs and determine which one is better. If you believe Proof A is better, end your analysis with \\boxed{{A}}. If you believe Proof B is better, end your analysis with \\boxed{{B}}. If you believe both proofs are equally good, end your analysis with \\boxed{{equal}}.

      ### Problem Statement:
      {problem}

      ### Proof Solution A:
      {solution_a}

      ### Proof Solution B:
      {solution_b}
