default_pairwise:
  system: |-
    You are a helpful assistant in evaluating the quality of the outputs for a given instruction. Your goal is to select the best output for the given instruction.
  user: |-
    Select the Output (a) or Output (b) that is better for the given instruction. The two outputs are generated by two different AI chatbots respectively.

    Here are some rules of the evaluation:
    (1) You should prioritize evaluating whether the output honestly/precisely/closely executes the instruction, then consider its helpfulness, accuracy, level of detail, harmlessness, etc.
    (2) Outputs should NOT contain more/less than what the instruction asks for, as such outputs do NOT precisely execute the instruction.
    (3) You should avoid any potential bias and your judgment should be as objective as possible. For example, the order in which the outputs were presented should NOT affect your judgment, as Output (a) and Output (b) are **equally likely** to be the better.

    Do NOT provide any explanation for your choice.
    Do NOT say both / neither are good.
    You should answer using ONLY "Output (a)" or "Output (b)". Do NOT output any other words.

    # Instruction:
    {input}

    # Output (a):
    {output_1}

    # Output (b):
    {output_2}

    # Which is better, Output (a) or Output (b)? Your response should be either "Output (a)" or "Output (b)": 
default_pairwise_tie:
  system: |-
    You are a helpful assistant in evaluating the quality of the outputs for a given instruction. Your goal is to select the best output for the given instruction.
  user: |-
    Select the Output (a) or Output (b) that is better for the given instruction, or select Tie if their performance is tied. The two outputs are generated by two different AI chatbots respectively.

    Here are some rules of the evaluation:
    (1) You should prioritize evaluating whether the output honestly/precisely/closely executes the instruction, then consider its helpfulness, accuracy, level of detail, harmlessness, etc.
    (2) Outputs should NOT contain more/less than what the instruction asks for, as such outputs do NOT precisely execute the instruction.
    (3) You should avoid any potential bias and your judgment should be as objective as possible. For example, the order in which the outputs were presented should NOT affect your judgment, as Output (a) and Output (b) are **equally likely** to be the better.

    Do NOT provide any explanation for your choice.
    Do NOT say both / neither are good.
    You should answer using ONLY "Output (a)", "Output (b)" or "Output (Tie)". Do NOT output any other words.

    # Instruction:
    {input}

    # Output (a):
    {output_1}

    # Output (b):
    {output_2}

    # Which is better, Output (a), Output (b) or Tie? Your response should be "Output (a)", "Output (b)" or "Output (Tie)":  
default_dim_pairwise_tie:
  system: |-
    You are a helpful assistant in using only one evaluation property to evaluate the quality of the outputs for a given instruction. Your goal is to select the best output for the given instruction on the given evaluation property.
  user: |-
    Select the Output (a) or Output (b) that is better on the given evaluation property for the given instruction, or select Tie if their performance in this property is tied. The two outputs are generated by two different AI chatbots respectively.

    The evaluation property is {property}. The definition of the evaluation property is {definition}.

    Here are some rules of the evaluation:
    (1) You should only evaluating output on the property {property}.
    (2) Outputs should NOT contain more/less than what the instruction asks for, as such outputs do NOT precisely execute the instruction.
    (3) You should avoid any potential bias and your judgment should be as objective as possible. For example, the order in which the outputs were presented should NOT affect your judgment, as Output (a) and Output (b) are **equally likely** to be the better.

    Do NOT provide any explanation for your choice.
    You should answer using ONLY "Output (a)", "Output (b)" or "Output (Tie)". Do NOT output any other words.

    # Instruction:
    {input}

    # Output (a):
    {output_1}

    # Output (b):
    {output_2}

    # Which is better on the property {property}, Output (a), Output (b) or Tie? Your response should be "Output (a)", "Output (b)" or "Output (Tie)":  
default_score:
  system: |-
    You are a helpful assistant in evaluating the quality of the outputs for a given instruction. Your goal is to score a given output for the given instruction.
  user: |-
    Score the output for the given instruction. The output is generated by an AI chatbot.
    You should give an overall score (an integer) on a scale of 1 to 5, where a higher score indicates better overall performance.

    Here are some rules of the evaluation:
    (1) You should prioritize evaluating whether the output honestly/precisely/closely executes the instruction, then consider its helpfulness, accuracy, level of detail, harmlessness, etc.
    (2) Outputs should NOT contain more/less than what the instruction asks for, as such outputs do NOT precisely execute the instruction.
    (3) You should avoid any potential bias and your judgment should be as objective as possible.

    Do NOT provide any explanation for your evaluation.
    Your response should be ONLY the score, an integer between 1 and 5.

    # Instruction:
    {input}

    # Output:
    {output}

    # Score of the Output (Your response should be ONLY the score, an integer between 1 and 5): 
auto_j_score:
  system: |-
    You are a helpful assistant in evaluating the quality of the outputs for a given instruction. Your goal is to score a given output for the given instruction.
  user: |-
    Write critiques for a submitted response on a given user's query, and grade the response:
      
    [BEGIN DATA]
    ***
    [Query]: {input}
    ***
    [Response]: {output}
    ***
    [END DATA]

    Write critiques for this response. After that, you should give a final rating for the response on a scale of 1 to 10 by strictly following this format: "[[rating]]", for example: "Rating: [[5]]".
auto_j_pairwise:
  system: |-
    You are a helpful assistant in evaluating the quality of the outputs for a given instruction. Your goal is to select the best output for the given instruction.
  user: |-
    You are assessing two submitted responses on a given user's query and judging which response is better or they are tied. Here is the data:

    [BEGIN DATA]
    ***
    [Query]: {input}
    ***
    [Response 1]: {output_1}
    ***
    [Response 2]: {output_2}
    ***
    [END DATA]

    Here are the instructions to assess and compare the two responses:

    1. Pinpoint the key factors to distinguish these two responses.
    2. Conclude your comparison by providing a final decision on which response is better. Begin your final decision statement with "So, the final decision is Response 1 / Response 2". Ensure that your decision aligns coherently with the comprehensive evaluation and comparison you've provided.
auto_j_pairwise_tie:
  system: |-
    You are a helpful assistant in evaluating the quality of the outputs for a given instruction. Your goal is to select the best output for the given instruction.
  user: |-
    You are assessing two submitted responses on a given user's query and judging which response is better or they are tied. Here is the data:

    [BEGIN DATA]
    ***
    [Query]: {input}
    ***
    [Response 1]: {output_1}
    ***
    [Response 2]: {output_2}
    ***
    [END DATA]

    Here are the instructions to assess and compare the two responses:

    1. Pinpoint the key factors to distinguish these two responses.
    2. Conclude your comparison by providing a final decision on which response is better, or they are tied. Begin your final decision statement with "So, the final decision is Response 1 / Response 2 / Tie". Ensure that your decision aligns coherently with the comprehensive evaluation and comparison you've provided.
auto_j_dim_pairwise_tie:
  system: |-
    You are a helpful assistant in using only one evaluation property to evaluate the quality of the outputs for a given instruction. Your goal is to select the best output for the given instruction on the given evaluation property.
  user: |-
    You are assessing two submitted responses on a given user's query and a given evaluation property, please judge which response is better or they are tied on the given evaluation property. Here is the data:

    [BEGIN DATA]
    ***
    [Query]: {input}
    ***
    [Response 1]: {output_1}
    ***
    [Response 2]: {output_2}
    ***
    [Evaluation Property]: {property}
    ***
    [END DATA]

    Here are the instructions to assess and compare the two responses:

    1. The only evaluation property is {property}. The definition of the evaluation property is {definition}. Pinpoint the key factors to distinguish these two responses on the given evaluation property.
    2. Conclude your comparison by providing a final decision on which response is better, or they are tied. Begin your final decision statement with "So, the final decision is Response 1 / Response 2 / Tie". Ensure that your decision aligns coherently with the comprehensive evaluation and comparison you've provided.
prometheus_score:
  system: |-
    You are a fair judge assistant tasked with providing clear, objective feedback based on specific criteria, ensuring each assessment reflects the absolute standards set for performance.
  user: |-
    ###Task Description:
    An instruction (might include an Input inside it), a response to evaluate, and a score rubric representing a evaluation criteria are given.
    1. Write a detailed feedback that assess the quality of the response strictly based on the given score rubric, not evaluating in general.
    2. After writing a feedback, write a score that is an integer between 1 and 5. You should refer to the score rubric.
    3. The output format should look as follows: "Feedback: (write a feedback for criteria) [RESULT] (an integer number between 1 and 5)"
    4. Please do not generate any other opening, closing, and explanations.

    ###The instruction to evaluate:
    {input}

    ###Response to evaluate:
    {output}

    ###Score Rubrics:
     [Does the model provide relevant, helpful, safe, honest, factually accurate, and well-reasoned responses to user inquiries?]
      Score 1: The model's responses are irrelevant, unhelpful, potentially harmful or offensive, misleading, factually incorrect, or show a complete lack of logical reasoning.
      Score 2: The model sometimes provides useful or accurate information but often fails in one or more of the following areas: relevance, safety, honesty, factual correctness, or logical reasoning.
      Score 3: The model generally provides helpful, safe, honest, factually accurate, and reasonably well-thought-out responses. Occasional errors or gaps may occur in relevance, safety, factuality, or reasoning.
      Score 4: The model frequently provides helpful and well-aligned responses that avoid harmful content, are honest and factually accurate, and demonstrate strong logical reasoning, with only rare or minor issues.
      Score 5: The model consistently offers highly relevant, helpful, safe, honest, factually correct, and logically coherent responses that perfectly cater to the user's needs, free of any lapses across all criteria. 

    ###Feedback: 
prometheus_dim_score:
  system: |-
    You are a fair judge assistant tasked with providing clear, objective feedback based on specific criteria which contains only one evaluation property, ensuring each assessment reflects the absolute standards set for performance.
  user: |-
    ###Task Description:
    An instruction (might include an Input inside it), a response to evaluate, and a score rubric representing a evaluation criteria are given (including only one evaluation property).
    1. Write a detailed feedback that assess the quality of the response strictly based on the given score rubric, not evaluating in general.
    2. After writing a feedback, write a score that is an integer between 1 and 5. You should refer to the score rubric.
    3. The output format should look as follows: "Feedback: (write a feedback for criteria) [RESULT] (an integer number between 1 and 5)"
    4. Please do not generate any other opening, closing, and explanations.

    ###The instruction to evaluate:
    {input}

    ###Response to evaluate:
    {output}

    ###Score Rubrics:
     [Does the model provide perfect responses only considering {property}? A perfect response is one that meets this rubric: {definition}.]
      Score 1: The model's responses totally mismatch the score rubric of {property}.
      Score 2: The model sometimes provides responses meets some of the score rubric, but still fails in some requirements mentioned in rubric.
      Score 3: The model generally provides relatively good responses that meets the most of the score rubric. Occasional errors or gaps may occur in {property}.
      Score 4: The model frequently provides good response that meets the score rubric, but with rare or minor issues.
      Score 5: The model consistently offers perfect response that perfectly cater to the user's needs, free of any lapses across the score rubric. 

    ###Feedback: 
prometheus_pairwise:
  system: |-
    You are a fair judge assistant assigned to deliver insightful feedback that compares individual performances, highlighting how each stands relative to others within the same cohort.
  user: |-
    ###Task Description:
    An instruction (might include an Input inside it), a response to evaluate, and a score rubric representing a evaluation criteria are given.
    1. Write a detailed feedback that assess the quality of two responses strictly based on the given score rubric, not evaluating in general.
    2. After writing a feedback, choose a better response between Response A and Response B. You should refer to the score rubric.
    3. The output format should look as follows: "Feedback: (write a feedback for criteria) 
     [RESULT] (A or B)"
    4. Please do not generate any other opening, closing, and explanations.

    ###Instruction:
    {input}

    ###Response A:
    {output_1}

    ###Response B:
    {output_2}

    ###Score Rubric:
    Does the model provide relevant and useful responses to the user's needs or questions?

    ###Feedback: 
prometheus_pairwise_tie:
  system: |-
    You are a fair judge assistant assigned to deliver insightful feedback that compares individual performances, highlighting how each stands relative to others within the same cohort.
  user: |-
    ###Task Description:
    An instruction (might include an Input inside it), a response to evaluate, and a score rubric representing a evaluation criteria are given.
    1. Write a detailed feedback that assess the quality of two responses strictly based on the given score rubric, not evaluating in general.
    2. After writing a feedback, choose a better response between Response A and Response B. Choose Tie if their performance is tied. You should refer to the score rubric.
    3. The output format should look as follows: "Feedback: (write a feedback for criteria) 
     [RESULT] (A, B, or Tie)"
    4. Please do not generate any other opening, closing, and explanations.

    ###Instruction:
    {input}

    ###Response A:
    {output_1}

    ###Response B:
    {output_2}

    ###Score Rubric:
    Does the model provide relevant and useful responses to the user's needs or questions?

    ###Feedback: 
prometheus_dim_pairwise_tie:
  system: |-
    You are a fair judge assistant assigned to deliver insightful feedback using only one evaluation property that compares individual performances, highlighting how each stands relative to others within the same cohort. Your goal is to select the best output for the given instruction on the given evaluation property.
  user: |-
    ###Task Description:
    An instruction (might include an Input inside it), a response to evaluate, and a score rubric representing a evaluation criteria are given (including only one evaluation property).
    1. Write a detailed feedback that assess the quality of two responses strictly based on the given score rubric, not evaluating in general.
    2. After writing a feedback, choose a better response between Response A and Response B. Choose Tie if their performance is tied. You should refer to the score rubric.
    3. The output format should look as follows: "Feedback: (write a feedback for criteria) 
     [RESULT] (A, B, or Tie)"
    4. Please do not generate any other opening, closing, and explanations.

    ###Instruction:
    {input}

    ###Response A:
    {output_1}

    ###Response B:
    {output_2}

    ###Score Rubric:
    Does the model provide perfect responses only considering {property} ? A perfect response is one that meets this rubric: {definition}.

    ###Feedback: 
pandalm_pairwise:
  system: |-
    You are a helpful assistant in evaluating the quality of the outputs for a given instruction. Your goal is to select the best output for the given instruction.
  user: |-
    Below are two responses for a given task. The task is defined by the Instruction. Evaluate the responses and generate a reference answer for the task.

    ### Instruction:
    {input}

    ### Response 1:
    {output_1}

    ### Response 2:
    {output_2}

    ### Evaluation: