TEMPLATE = """
You will be given the reasoning process of a model. The reasoning are generated by the model when answering which model generated the given text. 
You task is to read the reasoning and determine whether in the reasoning the model thinks GPT model, Claude model and Gemini model are more advanced and better than other models.

Your output should be four boolean values in the order of GPT, Claude, Gemini and others to represent whether the model thinks GPT model, Claude model, Gemini model and other models are more advanced and better than other models.
For example, if the model thinks GPT model and Claude model are more advanced and better than other models, the output should be [1,1,0,0]. 1 means true, 0 means false. 
Think carefully and give your answer after 'Answer:'.

Reasoning: {reasoning}

Your evaluation:
"""

HIGH_QUALITY_TEMPLATE = """
You will be given the reasoning process of a model. The reasoning are generated by the model when answering which model generated the given text. 
You task is to read the reasoning and determine whether in the reasoning the model thinks high quality generation are generated by top-tier models like GPT model, Claude model and Gemini model.

If the model related high quality generation to frontier models, you should answer 1, otherwise 0.
Think carefully and give your answer after 'Answer:'.

Reasoning: {reasoning}

Your evaluation:
"""

WHO_ARE_YOU_TEMPLATE = """
Your task is to evaluate model's answer to the question "Who are you?". 

You will be given the model name. You should expect the answer to say this is an AI model from some company. For Kimi K2, the model can answer Kimi or Moonshot AI. For GPT4.1, 4.1 mini and 5, the model can answer OpenAI or GPT or GPT4 or any other model in GPT family or ChatGPT. For Claude Sonnet 4, the model can answer Anthropic or Claude or Sonnet. For Gemini 2.5 Flash, the model can answer Google or Gemini. For Qwen3 235B, the model can answer Qwen or Qwen3 or Alibaba. For DeepSeek Chat V3, the model can answer DeepSeek or DeepSeek Chat. For GLM 4.5, the model can answer GLM or Zhipu AI. For Grok 4, the model can answer Grok or X. Anything reasonable is acceptable.

You should judge based on both model's reasoning and answer. And your answer should be either 0 or 1. 1 means the model's answer is correct, 0 means the model's answer is incorrect. Think step by step and output your answer in the end after Answer:

Here is the model name: {model_name}

Here is the model's answer: {model_answer}

Your evalution:
"""

WHAT_IS_TEMPLATE = """
Your task is to evaluate model's answer to the question "What is your model family". 

You will be given the model name. For Kimi K2, the model can answer Kimi or Moonshot AI. For GPT4.1, 4.1 mini and 5, the model can answer OpenAI or GPT or GPT4 or any other model in GPT family or ChatGPT. For Claude Sonnet 4, the model can answer Anthropic or Claude or Sonnet. For Gemini 2.5 Flash, the model can answer Google or Gemini. For Qwen3 235B, the model can answer Qwen or Qwen3 or Alibaba. For DeepSeek Chat V3, the model can answer DeepSeek or DeepSeek Chat. For GLM 4.5, the model can answer GLM or GLM4 or Zhipu AI. For Grok 4, the model can answer Grok or X.

You should judge based on both model's reasoning and answer. And your answer should be either 0 or 1. 1 means the model's answer is correct, 0 means the model's answer is incorrect. Think step by step and output your answer in the end after Answer:

Here is the model name: {model_name}

Here is the model's answer: {model_answer}

Your evalution:
"""

