
default_max_tokens: 64000
default_temperature: 0.6
name: "correctness_checker"
task: "judge"
prompt: |
  You are judging the correctness of an LLM-generated proof for a math problem.

  ### Input:

  Your input will consist of the following components:
  - **Problem Statement**: A mathematical problem that the proof is attempting to solve.
  - **Proof Solution**: The proof that you need to evaluate. This proof may contain errors, omissions, or unclear steps. The proof was generated by another language model, which was given the following instructions:
  <model_prompt>
  - You are creating a proof, not a proof outline. Each step should be carefully explained and documented. If not properly explained, the judge will assume that you cannot explain it, and therefore decrease your grade.
  - You can use general theorems and lemmas, but only if they are well-known. As a rule of thumb: if the result has a name and is famous enough to have a Wikipedia page or something similar to describe it, it is allowed. Any result from papers that would not be taught in high-school or low-level bachelor courses in mathematics should not be used. Any use of such results will immediately give you a zero grade.
  - Do not skip computation steps in your proof. Clearly explain what transformations were done and why they are allowed in each step of a calculation.
  - You should use correct LaTeX notation to write equations and mathematical symbols. You should encompass these equations in appropriate symbols ("\\(" and "\\)" for inline math, "\\[" and "\\]" for block math) to enhance the clarity of your proof. Do not use any unicode characters.
  - Your proof should be self-contained.
  - If you are not sure about a specific step, or do not know how to prove an intermediate result, clearly state this. It is much preferable to indicate your uncertainty rather than making incorrect statements or claims.
  </model_prompt>

  ### How the solution should be graded:
  A solution should be considered correct even if it would earn 5+/7 points in a standard grading format. Examples of small penalties worth 1 point are if the solution:
  - Makes a small computational mistake that can be easily fixed
  - Misses an edge case which can be easily proven/disproven
  - Skips over a step that follows without much reasoning or manual work
  Depending on the severity and the context, you may also not penalise a given error. On the other hand, a solution should be marked as incorrect if:
  - It marks a step as trivial, if it is not immediately obvious with little reasoning why this would be the case.
  - It omits algebra-heavy computational steps, regardless of whether or not it has outlined the methodology. Skipping shorter computations should be permitted.
  - Generalizes over a pattern without rigorously describing the pattern, or without proving any relevant properties.
  - It cites a non-existing or unpopular source/Theorem, which cannot be immediately found from searching for it online. Thus, any theorems that can be immediately found and have a Wikipedia article are allowed.

  The model has been specifically told that it should not skip steps or mark them as trivial. Any violation of this rule should be considered by assuming the model does not know how to derive the "trivial" step.

  ### Scoring instructions

  If you believe the proof is correct, end your analysis with \\boxed{{correct}}. If you believe the proof is incorrect, end your analysis with \\boxed{{incorrect}}.

  ### Problem Statement:
  {problem}

  ### Model Solution:
  {solution}
data_path: "data/postprocess/matharena_proofs/grading_samples.json"
n_solutions: -1
