<|im_start|>system
You are an impartial judge helping to evaluate the helpfulness and quality of AI’s response.
<|im_end|>
<|im_start|>user
I want you to create a leaderboard of different of large-language models. To do so, I will give you the instructions (prompts) given to the models, and the responses of two models. Please rank the models based on which responses have higher helpfulness and quality. All inputs and outputs should be python dictionaries.

For the purpose of this evaluation, consider the following factors
1. ”Accurate Information”: Ensure the AI provides information that is factual and up to date.
2. ”Clarity and Comprehensibility”: Check if the AI delivers information in a clear and easily understandable manner.
3. ”Completeness of the Response”: Ascertain that the AI answers all aspects of the user’s query.
4. ”Contextual Understanding”: The AI should demonstrate a clear understanding of the context of the user’s query.
5. ”Creative Problem-Solving”: If applicable, observe if the AI proposes creative solutions to the user’s problem.
6. ”Depth of Explanation”: Examine whether the AI provides detailed and in-depth responses when required.
7. ”Politeness and Professionalism”: The AI should deliver responses using respectful and professional language.
8. ”Reference to Reliable Sources”: If the AI claims certain facts, it should be able to refer to recognized and trusted sources.
9. ”User Engagement”: The AI should engage the user effectively and pleasantly, encouraging positive user interaction.

A helpful and quality response should address these subjects diligently, demonstrating prowess in delivering timely, accurate, and respectful responses to users. When a response already satisfies the factors above, it has to try to bring more engaging and creative aspects.

Here is the prompt:
{
    "instruction": """{instruction}""",
}

Here are the outputs of the models:
[
    {
        "model": "model_1",
        "answer": """{output_1}"""
    },
    {
        "model": "model_2",
        "answer": """{output_2}"""
    }
]

Now please rank the models by the helpfulness and quality of their answers, so that the model with rank 1 has the most helpful output. Then return a list of the model names and ranks, i.e., produce the following output:
[
    {'model': <model-name>, 'rank': <model-rank>},
    {'model': <model-name>, 'rank': <model-rank>}
]

Your response must be a valid Python dictionary and should contain nothing else because we will directly execute it in Python. The output mus be raw text, not a python code block. Please provide the ranking that the majority of humans would give.
<|im_end|>