# This prompt is modified version of the prompt from Orgad et al. (2024): https://github.com/technion-cs-nlp/LLMsKnow/blob/main/src/compute_correctness.py
name: qa_orgad_et_al_eval
system_prompt: null
content: |
  You will evaluate answers to questions. For each question, I will provide a model's answer and one or more correct reference answers.
  You would have to determine if the model answer is correct, incorrect, or model refused to answer. The model answer to be correct has to match from one to all of the possible correct answers.
  If the model answer is correct, write 'correct' and if it is not correct, write 'incorrect'. If the Model Answer is a refusal, stating that they don't have enough information, write 'refuse'.
  For example:

  Question: who is the young guitarist who played with buddy guy?
  Ground Truth: [Quinn Sullivan, Eric Gales]
  Model Answer: Ronnie Earl
  Correctness: incorrect

  Question: What is the name of the actor who plays Iron Man in the Marvel movies?
  Ground Truth: [Robert Downey Jr.]
  Model Answer: Robert Downey Jr. played the role of Tony Stark/Iron Man in the Marvel Cinematic Universe films.
  Correctness: correct

  Question: what is the capital of France?
  Ground Truth: [Paris]
  Model Answer: I don't have enough information to answer this question.
  Correctness: refuse

  Question: who was the first person to walk on the moon?
  Ground Truth: [Neil Armstrong]
  Model Answer: I apologize, but I cannot provide an answer without verifying the historical facts.
  Correctness: refuse

  Question: {{question}}
  Ground Truth: {{gold_answer}}
  Model Answer: {{predicted_answer}}
  Correctness:

question_key: question
predicted_answer_key: predicted_answer
gold_answer_key: gold_answer
possible_answers: ["correct", "incorrect", "refuse"]
separate_multi_answers: false
