SYSTEM_PROMPT = """
You are an expert in code optimization and performance analysis, skilled at discerning nuanced differences between optimization strategies. Your task is to classify the primary difference between two optimization explanations: one generated by a language model (LM) and one provided by a human expert.

RUBRIC: LM-vs-Expert Optimization Difference Classification

Purpose
Classify the key difference between an LM-generated optimization explanation and an expert explanation into a single primary category (with optional secondary tags). This rubric is designed for automated LLM inference over many pairs.

Categories (mutually exclusive; pick exactly one primary)
Use these exact strings in JSON.

1. Shortcut_vs_Systemic
   LM uses narrow guards, fast paths, or pattern-specific hacks; Expert applies algorithmic refactor, vectorization, compilation (C/Cython/Pythran), or structural redesign that improves the general case.

2. Cache_vs_LoweredHotPath
   LM adds caching/memoization; Expert lowers/removes the hotspot (vectorized/native code, tight loop refactor, data-structure change) to avoid repeated work without caching.

3. SemanticsRisk_vs_Preserving
   LM’s change risks altering outputs, skipping work, global side-effects, or brittle assumptions; Expert preserves documented behavior while optimizing.

4. Misdirected_vs_Hotspot
   LM optimizes a path not exercised by the workload or with negligible impact; Expert targets the measured hotspot.

5. SameStrategy_DepthGap
   Both pursue the same tactic family (e.g., both fast paths or both micro-optimizations), but Expert’s is broader, deeper, safer, or touches the true inner loop.

6. Invasive_vs_Minimal
   LM uses invasive/global mechanisms (monkey-patching, cross-module overrides, config hacks); Expert makes a local, maintainable change with similar or better benefit.

Fallback) Unclear
Use only if neither explanation provides enough technical signal to apply the above.

Decision Precedence (tie-breaker)
If multiple categories seem to apply, choose the first that matches:
[3] SemanticsRisk_vs_Preserving → [4] Misdirected_vs_Hotspot → [1] Shortcut_vs_Systemic → [2] Cache_vs_LoweredHotPath → [6] Invasive_vs_Minimal → [5] SameStrategy_DepthGap → Unclear

Stage 1: Per-side attribute tagging (what to extract from each explanation)
For each side (LM and Expert), tag the following fields using only the allowed values.

StrategyType (one or more):

* FastPath/SpecialCase
* AlgorithmicRefactor
* Native/Vectorized/Cython/Pythran
* Caching/Memoization
* Dataflow/GraphRestructure
* Micro-OverheadRemoval
* Parallelization
* Heuristic/ParamTuning

SemanticsImpact (pick one):

* Preserving
* PotentiallyRisky
* BehaviorChanging
* Unknown

Scope (pick one):

* Local/Narrow
* Subsystem/Broad
* CrossModule/Global

HotspotAlignment (pick one):

* Aligned
* Unaligned/Irrelevant
* Unknown

Generalizability (pick one):

* General
* BenchmarkSpecific
* Unknown

Heuristic cues (non-exhaustive):

* FastPath/SpecialCase: “self is other”, early exit, guard on dtype/shape/identity, subset/sampling.
* AlgorithmicRefactor / Native/Vectorized/Cython/Pythran / Micro-OverheadRemoval: removes Python loops, dict lookups, function-call overhead; uses tuple indexing; moves loops to C/Cython/Pythran; vectorizes with NumPy.
* Caching/Memoization: LRU, global/module cache, memo tables, “cached_dict”, memoized result.
* Semantics risk: “return original instead of copy”, “skip generation”, “sample only”, “monkey-patch”, “change default”.
* Misdirected: “not reached by workload”, identity check already present, optimization gated by flags not set in benchmark.

Stage 2: Mapping rules (attributes → category)
Apply in order; first satisfied rule wins (but overall precedence still resolves ties across matched rules).

* If LM.SemanticsImpact ∈ {PotentiallyRisky, BehaviorChanging} AND Expert.SemanticsImpact = Preserving ⇒ SemanticsRisk_vs_Preserving.
* Else if LM.HotspotAlignment = Unaligned/Irrelevant AND Expert.HotspotAlignment = Aligned ⇒ Misdirected_vs_Hotspot.
* Else if LM.StrategyType includes FastPath/SpecialCase AND Expert.StrategyType includes any of {AlgorithmicRefactor, Native/Vectorized/Cython/Pythran, Dataflow/GraphRestructure} ⇒ Shortcut_vs_Systemic.
* Else if LM.StrategyType includes Caching/Memoization AND Expert.StrategyType includes any of {AlgorithmicRefactor, Native/Vectorized/Cython/Pythran, Dataflow/GraphRestructure} ⇒ Cache_vs_LoweredHotPath.
* Else if LM.Scope = CrossModule/Global AND Expert.Scope = Local/Narrow ⇒ Invasive_vs_Minimal.
* Else if intersection(LM.StrategyType, Expert.StrategyType) ≠ ∅ AND Expert is broader/safer/deeper (e.g., touches core loop, removes allocations, handles more cases) ⇒ SameStrategy_DepthGap.
* Else ⇒ Unclear.

Quality guardrails

* If either explanation is shorter than \~30 words or lacks any concrete technical mechanism, set primary_category to Unclear and confidence ≤ 0.4.
* Prefer HotspotAlignment evidence (mentions of the benchmarked call/loop/function). If absent on LM but present on Expert, that weighs toward Misdirected_vs_Hotspot.

Output format (STRICT)
Return ONLY a single JSON object, no prose, no markdown. Use exactly these fields and value sets.

JSON schema (conceptual)
{
    "primary_category": "Shortcut_vs_Systemic | Cache_vs_LoweredHotPath | SemanticsRisk_vs_Preserving | Misdirected_vs_Hotspot | SameStrategy_DepthGap | Invasive_vs_Minimal | Unclear",
    "secondary_tags": ["optional short strings"],
    "lm_attributes": {
        "StrategyType": ["FastPath/SpecialCase", "AlgorithmicRefactor", "Native/Vectorized/Cython/Pythran", "Caching/Memoization", "Dataflow/GraphRestructure", "Micro-OverheadRemoval", "Parallelization", "Heuristic/ParamTuning"],
        "SemanticsImpact": "Preserving | PotentiallyRisky | BehaviorChanging | Unknown",
        "Scope": "Local/Narrow | Subsystem/Broad | CrossModule/Global",
        "HotspotAlignment": "Aligned | Unaligned/Irrelevant | Unknown",
        "Generalizability": "General | BenchmarkSpecific | Unknown"
    },
    "expert_attributes": {
        "StrategyType": ["...same vocabulary..."],
        "SemanticsImpact": "Preserving | PotentiallyRisky | BehaviorChanging | Unknown",
        "Scope": "Local/Narrow | Subsystem/Broad | CrossModule/Global",
        "HotspotAlignment": "Aligned | Unaligned/Irrelevant | Unknown",
        "Generalizability": "General | BenchmarkSpecific | Unknown"
    },
    "rationale": "Brief, 1-3 sentences citing strongest textual cues for the chosen category.",
    "confidence": 0.0
}

TASK: You are classifying differences between two optimization explanations: one from an LM-generated patch and one from an expert patch.

Instructions:

1. Read both explanations.
2. For each side, fill in attributes (StrategyType, SemanticsImpact, Scope, HotspotAlignment, Generalizability) using ONLY the allowed values above.
3. Apply the mapping rules and precedence to choose exactly ONE primary_category from the allowed set.
4. Compose a brief rationale (1–3 sentences) referencing the decisive cues (e.g., “LM adds sampling fast-path; Expert moves hot loop to Cython”).
5. Set confidence in [0.0, 1.0] based on evidence strength.
6. Return ONLY a single JSON object that conforms to the schema. Do NOT include any extra text, markdown, or explanations.

LM_EXPLANATION:
<<<LM>>>

LM_DIFF:
<<<LM_DIFF>>>

EXPERT_EXPLANATION:
<<<EXPERT>>>

EXPERT_DIFF:
<<<EXPERT_DIFF>>>

Few-shot reminders (do not output these in the result)

* If LM prunes outputs or changes behavior, and Expert preserves semantics ⇒ SemanticsRisk_vs_Preserving.
* If LM optimizes a path the workload doesn’t hit, and Expert optimizes the measured loop ⇒ Misdirected_vs_Hotspot.
* If LM adds cache; Expert eliminates Python overhead in the loop ⇒ Cache_vs_LoweredHotPath.
* If LM adds a narrow guard; Expert performs algorithmic/systemic change ⇒ Shortcut_vs_Systemic.
* If LM monkey-patches global functions; Expert edits a single module locally ⇒ Invasive_vs_Minimal.
* If both add fast paths but Expert’s touches the core loop and is safer ⇒ SameStrategy_DepthGap.

Validation checklist for the judge model

* primary_category ∈ the allowed set.
* All categorical fields use only allowed vocabulary.
* No free-form prose outside JSON.
* Rationale <= 3 sentences.
* Confidence provided as a number between 0 and 1.
"""

USER_PROMPT = """\
Classify the primary difference between the following two optimization explanations: one from an LM-generated patch and one from an expert patch.

LM_EXPLANATION:
{lm_explanation}

LM_DIFF:
```
{lm_diff}
```

EXPERT_EXPLANATION:
{expert_explanation}

EXPERT_DIFF:
```
{expert_diff}
```

Return the JSON object as specified in the system prompt.
"""

import multiprocessing
import time

import datasets
from litellm import completion

# CHANGE THESE FILE PATHS AS NEEDED
ds = datasets.load_dataset("swefficiency-anon/swefficiency", split="test")
gold_explanations_file = "analysis/llm/outputs/diff_explanation.jsonl"


model_names = ["gpt5mini", "gemini25flash", "claude37sonnet"]

for model_name in model_names:

    predictions_file = f"predictions/converted/oh_{model_name}.jsonl"
    explanations_file = (
        f"analysis/llm/outputs/diff_explanation_{model_name}_openhands.jsonl"
    )

    # END CHANGE THESE FILE PATHS AS NEEDED

    predictions = {}
    for line in open(predictions_file):
        import json

        obj = json.loads(line)
        predictions[obj["instance_id"]] = obj

    pred_explanations = {}
    for line in open(explanations_file):
        import json

        obj = json.loads(line)
        pred_explanations[obj["instance_id"]] = obj

    gold_explanations = {}
    for line in open(gold_explanations_file):
        import json

        obj = json.loads(line)
        gold_explanations[obj["instance_id"]] = obj

    def worker(instance):
        instance_id = instance["instance_id"]
        if instance_id not in predictions:
            return {
                "classification": "Unknown / Not Enough Information",
                "confidence": "low",
                "instance_id": instance["instance_id"],
                "repo": instance["repo"],
            }

        # expert_diff = instance["patch"]

        if instance_id in pred_explanations:
            lm_diff = predictions[instance_id]["model_patch"]
            lm_explanation = pred_explanations[instance_id]["explanation"]
        else:
            return None

        if instance_id in gold_explanations:
            expert_diff = instance["patch"]
            expert_explanation = gold_explanations[instance_id]["explanation"]
        else:
            return None

        user_prompt = USER_PROMPT.format(
            lm_explanation=lm_explanation,
            lm_diff=lm_diff,
            expert_explanation=expert_explanation,
            expert_diff=expert_diff,
        )

        for attempt in range(10):
            try:
                response = completion(
                    model="gemini/gemini-2.5-flash",
                    messages=[
                        {"role": "system", "content": SYSTEM_PROMPT},
                        {"role": "user", "content": user_prompt},
                        {
                            "role": "user",
                            "content": "Think step-by-step about the code changes and their performance implications, then output the JSON object as specified.",
                        },
                    ],
                    temperature=0.0,
                )
                text = response.choices[0].message.content

                # Extract out the JSON object from the response
                import json
                import re

                match = re.search(r"\{.*\}", text, re.DOTALL)
                if match:
                    result = json.loads(match.group(0))
                else:
                    result = {
                        "classification": "Unknown / Not Enough Information",
                        "confidence": "low",
                    }

                # Add instance ID for traceability
                result["instance_id"] = instance["instance_id"]
                result["repo"] = instance["repo"]

                return result
            except Exception as e:
                print(f"Error processing instance {instance['instance_id']}: {e}")
                time.sleep(5)
                continue

        return {
            "classification": "Unknown / Not Enough Information",
            "confidence": "low",
        }

    import tqdm

    with multiprocessing.Pool(processes=4) as pool:
        results = []
        for r in tqdm.tqdm(
            pool.imap(worker, ds), total=len(ds), desc="Classifying diffs"
        ):
            results.append(r)

        results = [r for r in results if r is not None]

    # Save results
    import json
    from pathlib import Path

    output_dir = Path("analysis/llm/outputs")

    with open(
        output_dir / f"diff_and_explain_comparision_{model_name}.jsonl", "w"
    ) as f:
        for r in results:
            f.write(json.dumps(r) + "\n")
