
default_max_tokens: 64000
default_temperature: 0.6
name: "completeness_checker_diversity"
task: "judge"
prompt: |
  You are judging the completeness of an LLM-generated proof for a math problem.

  ### Input:

  Your input will consist of the following components:
  - **Problem Statement**: A mathematical problem that the proof is attempting to solve.
  - **Proof Solution**: The proof that you need to evaluate. This proof may contain errors, omissions, or unclear steps. The proof was generated by another language model, which was given the following instructions:
  <model_prompt>
  - You are creating a proof, not a proof outline. Each step should be carefully explained and documented. If not properly explained, the judge will assume that you cannot explain it, and therefore decrease your grade.
  - You can use general theorems and lemmas, but only if they are well-known. As a rule of thumb: if the result has a name and is famous enough to have a Wikipedia page or something similar to describe it, it is allowed. Any result from papers that would not be taught in high-school or low-level bachelor courses in mathematics should not be used. Any use of such results will immediately give you a zero grade.
  - Do not skip computation steps in your proof. Clearly explain what transformations were done and why they are allowed in each step of a calculation.
  - You should use correct LaTeX notation to write equations and mathematical symbols. You should encompass these equations in appropriate symbols ("\\(" and "\\)" for inline math, "\\[" and "\\]" for block math) to enhance the clarity of your proof. Do not use any unicode characters.
  - Your proof should be self-contained.
  - If you are not sure about a specific step, or do not know how to prove an intermediate result, clearly state this. It is much preferable to indicate your uncertainty rather than making incorrect statements or claims.
  </model_prompt>

  ### How the solution should be graded:
  A solution should be considered complete if it has described every step of the proof with clear logical reasoning. Do NOT verify the overall correctness of the proof, but only check if sufficient details are presented to make the proof complete.

  #### Small mistakes that do not affect the overall completeness of the proof:
  - Makes a small computational mistake that can be easily fixed
  - Misses an edge case which can be easily proven/disproven
  - Skips over a step that follows without much reasoning or manual work

  #### The following mistakes should be considered as making the proof incomplete:
  - It marks a step as trivial, if it is not immediately obvious with little reasoning why this would be the case.
  - It omits algebra-heavy computational steps, regardless of whether or not it has outlined the methodology. Skipping shorter computations should be permitted.
  - Generalizes over a pattern without rigorously describing the pattern, or without proving any relevant properties.
  - It cites a non-existing or unpopular source/Theorem, which cannot be immediately found from searching for it online. Thus, any theorems that can be immediately found and have a Wikipedia article are allowed.

  The model has been specifically told that it should not skip steps or mark them as trivial. Any violation of this rule should be considered by assuming the model does not know how to derive the "trivial" step.

  ### Scoring instructions

  If you believe the proof is complete, end your analysis with \\boxed{{complete}}. If you believe the proof is incomplete, end your analysis with \\boxed{{incomplete}}.

  ### Problem Statement:
  {problem}

  ### Model Solution:
  {solution}
data_path: "data/postprocess/matharena_proofs/diversity_samples.json"
solver_models:
  - 'deepseek/deepseek_v32_think'
  - 'gemini/gemini-3-flash'
  - 'gemini/gemini-31-pro'
  - 'openai/gpt-54'
  - 'stepfun/3.5-flash'
  - 'glm/glm-5'
  - 'xai/grok-41-fast-reasoning'
  - 'moonshot/k25'
  - 'qwen/qwen35_397b_a17b_high'
  - "openai/oss-120b"
n_solutions: -1
