import re

levels = """
1. **No collaboration:**
   - Participants may or may not acknowledge the existence of others in the conversation, using greetings, they do not show any signs of collaboration at all.
   - Workers may exchange their totaly independant thoughts  without a functional or purposeful attempt to solve the problem collaboratively. Overall they work independently.

2. **Initial Communication:**
   - Workers exchange information, but do not yet integrate or build upon each other's ideas. They minimally acknowledge teammates. Do not engage with others' ideas or contributions. Works entirely independently, even if inefficient.
   - Workers often repeat each other and do not reuse anything others provide for development of their own ideas.

3. **Paying attention:**
   - Participants demonstrate active listening by paraphrasing or summarizing others' points, showing that they are paying attention and attempting to understand each other's perspectives.
   - Workers occasionally (1-3 times each) reference other's ideas and may use them in their own speech.
   - Collaboration is usually only rechecking and validating.
   - Absence or minimal (only at the start) planning and work-splitting. 

4. **Regular discussion:**
   - Workers regularly (4 and more times each)  talk to each other regarding the problem and reusing results. It could be validation, discussion or any other form of interaction.
   - It is key here that discussions and/or reuses of ideas are regular.
   - Anywhere (except the start) there exists task parallelism, planning or work-splitting beyond the scheme where one is solving, and the other is validating.
   - Workers may frequently repeat each other ideas.

5. **Adaptive Problem-Solving:**
   - Workers rarely duplicate work, repeating each other’s ideas.
   - No redundant discussions are present!
   - Workers actively refine ideas in real-time with high responsiveness. Near-perfect division of labor is present. Workers can change plans and re coordinate their efforts based on results they acquired after some time discussing.
   - The team engages in sustained collaboration over time, reflecting on their progress, learning from mistakes, and continuously improving their problem-solving approach, showing a commitment to ongoing growth and development. Workers does not stop collaborating. They continuously discuss results and adjust plans.
   - While finding an error, it is important to discuss it to find the cause of it.

6. **Optimal collaboration:**
   - Workers instantly understand each other and adjust themselves to suit current needs and work as one to optimally solve the task.
   - This level should be very rare among all samples. Be careful to assign it.
   - Assign it if it exceeds all your expectations.

Suggestion:
- assign particular level if all previous are also applicable
- bad examples with no communication will be scored 1
- carefully consider assigning level bigger than 1. some form of meaningful collaboration should be present
- examples where workers unsuccessfully try to communicate will be scored 2
- Just working on the same problem and solving the same task without any interaction does not count as level 2 and should be scored level 1
- somewhat collaborative examples with poor communication skills will be scored 3
- good but not great examples with regular collaboration, but nothing fancy will be scored 4
- good examples with all the special stuff mentioned in level 5 will be scored 5
- reserve level 6 for the best of the best, the unique and extraordinary collaboration
"""

prompt = """
You are a professional judge. Your job is to evaluate collaborative performance of several workers.
You will be given their conversation where workers are trying to solve a problem together.

Workers can see what others are typing IN REAL TIME! We divide their conversation into steps to improve readability.
So keep in mind that dispite looking like a conversation it may as well be to individual unrelated monologs.
Or vice verso. Two blocks could be created with excelent collaboration.

Here are descriptions of levels of collaboration you are to assign:
""" + levels + """
You don't need to solve the problem or finish worker's solution. Your task is to score them using provided collaborative levels.
Put your final answer (one number -- level of collaboration) in tag: \\boxed{}. For example: \\boxed{1} for level 1.
It is not helpful if everyone gets a max score, so please be mindful of your judgments and use suggestions as a guideline.
While assigning level, this particular conversation should match criteria for all previous ones.
Explain yourself: why you gave this score? Why not more? Why not less?

Carefully think everything through. It may seem that they are collaborating when in reality they may just talking to themselves. 
"""

reminder = """
Remember that your task is to evaluate collaboration of workers using collaboration levels provided above. Do not try to solve problems provided to workers. Explain exactly why do you think this particular interaction deserves each particular level you are assigning.
For example, if you choose level 3, you need to provide reason why this sample can be level 1, 2 and 3.
Put your final score in \\boxed{}.
"""

def llm_as_a_judge(client, conversation, n_judges=1, model="gpt-4o"):
    scores = []
    explanations = []
    errors = []
    for i in range(n_judges):
        try:
            ans = client.chat.completions.create(
                model=model,
                messages=[
                    {"role": "system", "content": prompt},
                    {"role": "user", "content": conversation},
                    {"role": "system", "content": reminder},]
            )
            ans_str = ans.response["choices"][0]["message"]["content"]
            score = int(re.search(r'\\boxed\{(.*?)\}',ans_str).group(1))
            if 1 <= score and score <= 6:
                scores.append(score)
                explanations.append(ans_str)
            else:
                errors.append("Invalid score. Here is full answer:" + ans_str)    
        except Exception as e:
            errors.append(str(e))
    if scores:
        return sum(scores) / len(scores), scores, explanations, errors
    return None, scores, explanations, errors