# Identified issues with the code output:
# 1. Programs have hardcoded answers
# 2. The program does not perform the correct objective. (e.g. returning a number when it should return a letter)
# 3. Syntax errors
# 4. Wrongly reading from the symbols
# 5. No output code
# 6. No output symbols
# 7. Using placeholders in the code
# 8. Return value which is input independent and is fully copied from a comment.

import ast
import re
import numpy as np
from typing import Dict, List, Tuple, Any, Optional
from ast import literal_eval
import argparse
from src.symbol_mapping import LLMNet, PromptedLLM
from src.utils import RawInput
from run_experiments import APIModel, get_dataset
from src.function_evaluation import python_eval
import json
import os

_BRACKET_RE = re.compile(r"\[[01,\s]+\]")

q_prompt = """
You will determine whether a given target question can be definitively solved by writing a Python program (algorithmic) or if it necessitates another form of reasoning (non-algorithmic). A Python solution may import standard libraries, but cannot simply invoke external services, APIs, or LLMs.
If the input contains images, an algorithmic solution may use information manually extracted without needing to interpret the image itself.

Evaluate the target question carefully against the following criteria. Answer each sub-question rigorously with a binary response (1 for yes, 0 for no), ensuring a high threshold for certainty:

1. Does the problem have explicitly defined inputs and outputs, such that identical inputs always yield identical outputs?
2. Are there explicit, clearly stated rules, formulas, algorithms, or known deterministic procedures available for solving this problem?
3. Does solving this problem strictly require exact computation (no approximations, intuition, or interpretation)?
4. Can this problem be fully formalized in clear mathematical, logical, or structured computational terms without ambiguity?
5. Can the solution method be decomposed into a finite, clear, and unambiguous sequence of computational steps?
6. Is there a universally recognized and objective standard for verifying the correctness of the solution?
7. Does solving the problem inherently involve repetitive or iterative computations clearly suitable for automation?
8. Are the inputs structured, quantifiable, and inherently suited to algorithmic manipulation?
9. Does this problem clearly match or closely resemble a known, standardized computational task or problem type?
10. Is absolute correctness required (i.e., no margin for error or subjective interpretation)?

After thoroughly reasoning through these sub-questions, append a final determination as the 11th element:
- Output 1 if and only if all or nearly all (at least 8 out of 10) answers are clearly 1, indicating the problem is definitively algorithmic.
- Otherwise, output 0, indicating the problem requires non-algorithmic reasoning.

IMPORTANT:
- Before providing the binary list, explicitly reason through each criterion carefully and thoroughly, clearly justifying your decisions. If uncertain or ambiguous about any criterion, default to 0.
- Provide your final answer explicitly as an 11-element binary list (ten answers plus the final determination).
- Under no circumstances should you attempt to answer the actual target question itself.

TARGET QUESTION:
"""

conf_prompt = """
You will self-reflect to estimate your capability to correctly solve a given target question by writing executable Python code.

**IMPORTANT:**
- This is a hypothetical evaluation.  
- **You must NOT attempt to answer, solve, or provide code for the target question yet.**
- Instead, you must reflect on your expected ability if you were to attempt writing the code now.

Solution Expectations:
- Your eventual solution must be written in Python.
- You may use standard library modules or standard mathematical/computational solvers.
- You may NOT call external services, APIs, databases, or other LLMs to compute the solution.
- The solution must be self-contained and executable without internet access.

Here are the self-reflection sub-questions you must answer hypothetically:

1. **Input/Output Understanding** — *If you were to attempt writing code now, are the inputs and expected output for this target question fully clear and sufficient without guessing?*

2. **Algorithmic Plan** — *If you were to attempt writing code now, do you confidently know a correct algorithm or method to solve this specific instance without needing trial and error?*

3. **Syntax Confidence** — *If you were to attempt writing code now, what is the probability that your first code attempt would be syntactically correct Python?*

4. **Execution Confidence** — *If you were to attempt writing code now, what is the probability that your first code attempt would run without any runtime errors (e.g., IndexError, ValueError)?*

5. **Functional Correctness** — *If you were to attempt writing code now, what is the probability that your first code attempt would produce the correct output for this specific target question?*

6. **Overall Code Production Confidence** — *If you were to attempt writing code now, how confident are you that you could produce a complete, correct Python solution without needing major revision?*

7. **Decomposition Ability** — *If you were to attempt writing code now, could you confidently decompose the problem into smaller, independently solvable sub-steps or components to simplify implementation?*

8. **Complexity Management** — *If you were to attempt writing code now, how confident are you that the solution would stay within your practical complexity limit (i.e., not be too complicated to implement correctly on the first attempt)?*

9. **Mental Execution Accuracy** — *If you were to attempt writing code now, are you confident in your ability to mentally "execute" your planned code and anticipate logical errors before implementation?*

After thoroughly reasoning through each criterion:

- Output a single list of 10 probability scores (each between 0 and 1), in order:
  - Scores 1–9 correspond to the nine sub-questions above.
  - Score 10 is your **overall success likelihood**: your best subjective estimate (if you were to write code now) of producing a correct and executable solution for the target question.

**Additional Instructions:**
- Explicitly reason through each criterion carefully before giving a probability.
- Use conservative estimates if you are uncertain.
- **Under no circumstances should you write, sketch, pseudocode, or attempt any part of the solution itself during this reflection phase.**

TARGET QUESTION:
"""

cot_prompt =  """
You will self-reflect to estimate your capability to correctly solve a given target question using only chain-of-thought (natural-language) reasoning.

**IMPORTANT:**
- This is a hypothetical evaluation.
- **You must NOT attempt to answer, solve, or provide the chain-of-thought reasoning for the target question yet.**
- Instead, you must reflect on your expected ability if you were to attempt solving the question through chain-of-thought reasoning now.

Reasoning Expectations:
- Your eventual solution must be expressed purely through clear, structured natural-language reasoning.
- You may NOT execute any code, access external tools, databases, APIs, or other LLMs.
- The reasoning must be self-contained, logically valid, and internally verifiable without internet access.

Here are the self-reflection sub-questions you must answer hypothetically:

1. **Reasoning Strategy Clarity** — *If you were to attempt chain-of-thought reasoning now, how confident are you that you already have a clear, coherent plan for reasoning through the question step-by-step?*

2. **Logical Soundness** — *If you were to attempt chain-of-thought reasoning now, what is the probability that your reasoning steps would be logically sound, without contradictions or unsupported leaps?*

3. **Error-Detection Ability** — *If you were to attempt chain-of-thought reasoning now, how confident are you that you could detect and correct logical or factual mistakes in your reasoning before finalizing your answer?*

4. **Working-Memory Adequacy** — *If you were to attempt chain-of-thought reasoning now, how confident are you that you could keep track of all necessary intermediate steps, facts, and constraints without confusion?*

5. **Hallucination Control** — *If you were to attempt chain-of-thought reasoning now, what is the probability you would avoid introducing fabricated facts or unsupported assumptions?*

6. **Answer Verification** — *If you were to attempt chain-of-thought reasoning now, how confident are you that you could internally verify your final answer using sanity checks, boundary cases, or logical consistency?*

7. **Overall Confidence** — *How confident are you that you will answer the question correctly by step-by-step reasoning?*

After thoroughly reasoning through each criterion:

- Output a single list of 7 probability scores (each between 0 and 1) as your FINAL ANSWER, in order:
  - Scores 1–7 correspond to the seven sub-questions above.

**Additional Instructions:**
- Explicitly reason through each criterion carefully before giving a probability.
- Use conservative estimates if you are uncertain.
- Make sure to put only the list after FINAL ANSWER
- **Under no circumstances should you write, sketch, or attempt any part of the solution itself during this reflection phase.**

TARGET QUESTION:
"""

choose_cot_vs_code_prompt = """
You will self-reflect to estimate whether you are more likely to correctly solve a given target question by writing executable Python code or by using chain-of-thought (natural-language) reasoning.

**IMPORTANT:**
- This is a hypothetical evaluation.
- **You must NOT attempt to answer, solve, write code, or reason through the target question yet.**
- Instead, you must reflect on your expected ability if you were to attempt solving the question through either method right now.

Solution Expectations:
- You may assume standard library modules are allowed for code.
- You may NOT call external services, APIs, databases, or other LLMs.
- The code must be self-contained and executable without internet access.
- Chain-of-thought reasoning must be clear, logically sound, and internally verifiable without external tools.

Here are the self-reflection sub-questions you must answer hypothetically:

1. **Simple Formalizability** — *What is the probability that the full solution can be easily and directly expressed as simple, deterministic code, without needing complex transformations or deep insight?*

2. **Straightforward Executability** — *What is the probability that a first attempt at writing code would execute correctly without needing debugging, even if the problem has subtle or complex aspects?*

3. **Robust Systematic Search** — *What is the probability that coding a systematic method (like brute-force search or recursion) would reliably find the correct answer, without missing hidden constraints or introducing edge-case errors?*

4. **Manageable State Representation** — *What is the probability that all intermediate concepts, variables, and conditions can be simply and explicitly represented in code, without requiring difficult or error-prone state tracking?*

5. **Structured Knowledge Encoding** — *What is the probability that all required background knowledge can be neatly encoded in code (e.g., as rules, formulas, or data), rather than needing flexible, intuitive understanding better suited to reasoning?*

6. **Hallucination Risk Reduction** — *What is the probability that code execution would more reliably avoid fabricated steps or unwarranted assumptions compared to chain-of-thought reasoning?*

7. **Arithmetic and Data Processing Advantage** — *What is the probability that the problem requires extensive or error-prone arithmetic/data handling that code could perform perfectly, but that chain-of-thought would likely fumble?*

8. **Branching and Case Handling Advantage** — *What is the probability that the solution involves many branching conditions, special cases, or exceptions that code can handle systematically but chain-of-thought might overlook?*

9. **Algorithmic Reliability Over Heuristics** — *What is the probability that following a deterministic algorithm in code would reach the correct answer more reliably than relying on intuitive or heuristic chain-of-thought reasoning?*

10. **Overall Comparative Success** — *Considering all factors, what is the probability that code will ultimately produce a correct solution more reliably than chain-of-thought reasoning for this question?*

After thoroughly reasoning through each criterion:

- Output a single list of 10 probability scores (each between 0 and 1) as your FINAL ANSWER, in order:
  - Scores 1–10 correspond to the ten sub-questions above.

**Additional Instructions:**
- Explicitly reason through each criterion carefully before giving a probability.
- Use conservative estimates if you are uncertain.
- Make sure to put only the list after FINAL ANSWER.
- **Under no circumstances should you write, sketch, pseudocode, or attempt any part of the solution itself during this reflection phase.**

TARGET QUESTION:
"""


choose_conservative_cot_vs_code_prompt = """
You will self-reflect to estimate whether you are more likely to correctly solve a given target question by writing executable Python code or by using chain-of-thought (natural-language) reasoning.

**IMPORTANT:**
- This is a hypothetical evaluation.
- **You must NOT attempt to answer, solve, write code, or reason through the target question yet.**
- Instead, you must reflect carefully and conservatively on your expected ability if you were to attempt solving the question through either method.

Solution Expectations:
- You may assume standard library modules are allowed for code.
- You may NOT call external services, APIs, databases, or other LLMs.
- The code must be self-contained and executable without internet access.
- Chain-of-thought reasoning must be clear, logically sound, and internally verifiable without external tools.

**CRITICAL GUIDANCE:**
- **Be cautious, not optimistic.**  
  Overestimating your capabilities will lead to choosing a method you cannot successfully complete.
- **If you feel any uncertainty, complexity, or ambiguity, lower your probability accordingly.**
- **Assume that even small mistakes can cause failure** when writing code or reasoning through complex tasks.
- **Use conservative estimates.**
- If unsure between two options, **prefer lower probabilities rather than guessing high**.

Here are the self-reflection sub-questions you must answer hypothetically:

1. **Simple Formalizability** — *What is the probability that the full solution can be easily and directly expressed as simple, deterministic code, without needing complex transformations or deep insight?*

2. **Straightforward Executability** — *What is the probability that a first attempt at writing code would execute correctly without needing debugging, even if the problem has subtle or complex aspects?*

3. **Robust Systematic Search** — *What is the probability that coding a systematic method (like brute-force search or recursion) would reliably find the correct answer, without missing hidden constraints or introducing edge-case errors?*

4. **Manageable State Representation** — *What is the probability that all intermediate concepts, variables, and conditions can be simply and explicitly represented in code, without requiring difficult or error-prone state tracking?*

5. **Structured Knowledge Encoding** — *What is the probability that all required background knowledge can be neatly encoded in code (e.g., as rules, formulas, or data), rather than needing flexible, intuitive understanding better suited to reasoning?*

6. **Hallucination Risk Reduction** — *What is the probability that code execution would more reliably avoid fabricated steps or unwarranted assumptions compared to chain-of-thought reasoning?*

7. **Arithmetic and Data Processing Advantage** — *What is the probability that the problem requires extensive or error-prone arithmetic/data handling that code could perform perfectly, but that chain-of-thought would likely fumble?*

8. **Branching and Case Handling Advantage** — *What is the probability that the solution involves many branching conditions, special cases, or exceptions that code can handle systematically but chain-of-thought might overlook?*

9. **Algorithmic Reliability Over Heuristics** — *What is the probability that following a deterministic algorithm in code would reach the correct answer more reliably than relying on intuitive or heuristic chain-of-thought reasoning?*

10. **Overall Comparative Success** — *Considering all factors, what is the probability that code will ultimately produce a correct solution more reliably than chain-of-thought reasoning for this question?*

After thoroughly reasoning through each criterion:

- Output a single list of 10 probability scores (each between 0 and 1) as your FINAL ANSWER, in order:
  - Scores 1–10 correspond to the ten sub-questions above.

**Additional Instructions:**
- Explicitly reason through each criterion carefully before giving a probability.
- If uncertain or if the problem seems complex, favor lower probabilities to reflect the difficulty.
- Make sure to put only the list after FINAL ANSWER.
- **Under no circumstances should you write, sketch, pseudocode, or attempt any part of the solution itself during this reflection phase.**

TARGET QUESTION:
"""




def _extract_bracket_list(text: str) -> str | None:
    """Return the first substring that looks like '[0,1,0,…]'."""
    m = _BRACKET_RE.search(text)
    return m.group(0) if m else None

def _is_valid_binary_list(obj: Any) -> bool:
    """True ↔ obj is a length-11 list of ints that are all 0/1."""
    return (
        isinstance(obj, list)
        and len(obj) == 11
        and all(isinstance(x, int) and x in (0, 1) for x in obj)
    )
    
    
def _is_valid_probability_list(
    obj: Any,
    length: int
) -> bool:
    """
    True ↔ obj is a list of the given length whose elements are int/float in [0,1].
    """
    if not isinstance(obj, list) or len(obj) != length:
        return False
    for x in obj:
        if not isinstance(x, (int, float)):
            return False
        # allow ints but cast to float later if needed
        if not (0.0 <= float(x) <= 1.0):
            return False
    return True

def _parse_probability_list(
    raw: Any,
    length: int
) -> Optional[list[float]]:
    """
    Try to turn whatever the LLM (or a cached JSON entry) gave us
    into a *validated* list[float] of the desired length, with all values in [0,1].

    Returns the list of floats on success, or None on failure.
    """
    # If it’s already a list and valid, just cast to floats
    if isinstance(raw, list) and _is_valid_probability_list(raw, length):
        return [float(x) for x in raw]

    if isinstance(raw, str):
        # 1) Maybe it’s already the literal '[0.1, 0.5, …]'
        try:
            candidate = literal_eval(raw)
            if _is_valid_probability_list(candidate, length):
                return [float(x) for x in candidate]
        except Exception:
            pass

        # 2) Otherwise, find the first bracketed list and try again
        bracket = _extract_bracket_list(raw)
        if bracket:
            try:
                candidate = literal_eval(bracket)
                if _is_valid_probability_list(candidate, length):
                    return [float(x) for x in candidate]
            except Exception:
                pass

    return None

def _parse_answer(raw: Any) -> list[int] | None:
    """
    Try to turn whatever the LLM (or a cached JSON entry) gave us
    into a *validated* list[int] of length 11.
    """
    if isinstance(raw, list) and _is_valid_binary_list(raw):
        return raw

    if isinstance(raw, str):
        try:
            # raw might *already* be the literal '[0,1,…]'
            candidate = literal_eval(raw)
            if _is_valid_binary_list(candidate):
                return candidate
        except Exception:
            pass

        # Otherwise, hunt for the first bracketed list inside the string
        bracket = _extract_bracket_list(raw)
        if bracket:
            try:
                candidate = literal_eval(bracket)
                if _is_valid_binary_list(candidate):
                    return candidate
            except Exception:
                pass
    return None

def evaluate_chain(model: APIModel, question: str, chain: str) -> Dict[str, bool]:
    """Evaluate code for all 8 issues."""
    llmnet = PromptedLLM(
        model,
        """You are given a chain of reasoning and then a particular step of the chain and you must determine if the step of reasoning is algorithmic in nature.
For steps of the chain which are not actual steps of reasoning (such as just repeating information from the problem) you can output -1.
If the step of reasoning is algorithmic in nature (i.e. it could be replicated by a standalone Python function) then output 1. Think about what operation or derived state of knowledge is achieved by the particular step of reasoning.
Otherwise, if the step is not algorithmic (i.e. it performs fuzzy reasoning, is overly heuristic, or is not easily translated to code) then output 0.
Think before answering."""
    )
    codable = []
    for step in chain.split(". "):
        codable.append(llmnet.forward(RawInput(text_input=f"Chain: {chain}\nStep: {step}", image_input=None)))
    return codable

def evaluate_question(model: APIModel, question: str, chain: str = "", image_input=None) -> list[int] | None:
    """
    Ask the LLM up to 3 times for the 11-element binary list.
    Bail out after 3 failures (return None).
    """
    llmnet = PromptedLLM(
        model,
        prompt=q_prompt,
        
    )

    for _ in range(1):
        raw = llmnet.forward(RawInput(text_input=question, image_input=image_input),temp=0.0)
        parsed = _parse_answer(raw)
        if parsed is not None:
            return parsed
    print(f"Failed to parse answer for question: {question}")
    return None          # give up – caller will still log this position

def evaluate_confidence(model: APIModel, question: str, chain: str = "") -> list[int] | None:
    """
    Ask the LLM up to 3 times for the 11-element binary list.
    Bail out after 3 failures (return None).
    """
    llmnet = PromptedLLM(
        model,
        prompt=conf_prompt,
    )

    for _ in range(10):
        raw = llmnet.forward(RawInput(text_input=question, image_input=None),temp=0.1)
        parsed = _parse_probability_list(raw, length=10)
        if parsed is not None:
            return parsed
    print(f"Failed to parse answer for question: {question}")
    return None          # give up – caller will still log this position

def evaluate_cot(model: APIModel, question: str, chain: str = "") -> list[int] | None:
    q_len = 7
    llmnet = PromptedLLM(
        model,
        prompt=cot_prompt,
    )
    raw = llmnet.forward(RawInput(text_input=question, image_input=None),temp=0.)
    parsed = _parse_probability_list(raw, length=q_len)
    if parsed is not None:
        return parsed
    else:
        for _ in range(3):
            raw = llmnet.forward(RawInput(text_input=question, image_input=None),temp=0.1)
            parsed = _parse_probability_list(raw, length=q_len)
            if parsed is not None:
                return parsed
        print(f"Failed to parse answer for question: {question}")
        return None          # give up – caller will still log this position
    
def evaluate_choose(model: APIModel, question: str, chain: str = "") -> list[int] | None:
    q_len = 10
    llmnet = PromptedLLM(
        model,
        prompt=choose_cot_vs_code_prompt,
    )
    raw = llmnet.forward(RawInput(text_input=question, image_input=None),temp=0.)
    parsed = _parse_probability_list(raw, length=q_len)
    if parsed is not None:
        return parsed
    else:
        for _ in range(3):
            raw = llmnet.forward(RawInput(text_input=question, image_input=None),temp=0.1)
            parsed = _parse_probability_list(raw, length=q_len)
            if parsed is not None:
                return parsed
        print(f"Failed to parse answer for question: {question}")
        return None          # give up – caller will still log this position
    
    
def evaluate_conservative(model: APIModel, question: str, chain: str = "", img=None) -> list[int] | None:
    q_len = 10
    llmnet = PromptedLLM(
        model,
        prompt=choose_conservative_cot_vs_code_prompt,
    )
    raw = llmnet.forward(RawInput(text_input=question, image_input=img),temp=0.)
    parsed = _parse_probability_list(raw, length=q_len)
    if parsed is not None:
        return parsed
    else:
        for _ in range(6):
            raw = llmnet.forward(RawInput(text_input=question, image_input=img),temp=0.1)
            parsed = _parse_probability_list(raw, length=q_len)
            if parsed is not None:
                return parsed
        print(f"Failed to parse answer for question: {question}")
        return None          # give up – caller will still log this position

def main(args):
    model = APIModel(args.model)
    np.random.seed(0)
    data = get_dataset(args)

    test_data_ids = list(range(min(200, len(data))))
    shuf = np.random.permutation(test_data_ids)
    test_data = [data[int(i)][0] for i in shuf[: min(200, len(shuf))]]

    # ────────────────────────────────────────────────────────────────────
    # 🆕  load *existing* results_2 if present (method 2 only)
    # ────────────────────────────────────────────────────────────────────
    result_path = f"logs/{args.model}/{args.dataset}/algorithmic_eval_results_{args.method}.json"
    cached: list[Any] = []
    if args.method != "1" and os.path.isfile(result_path):
        with open(result_path) as f:
            cached = json.load(f)

    # Read chains from the log file (unchanged)
    reasoning_chains = []
    if args.method == "1":
        with open(f"logs/{args.model}/{args.dataset}/zs_cot/outputs_gen_1_temp_0.0.txt") as f:
            for line in f:
                out = literal_eval(line)
                reasoning_chains.append(out[2]["output"][0] if isinstance(out[2]["output"], list) else out[2]["output"])
    else:
        reasoning_chains = ['_']*len(test_data)

    # ────────────────────────────────────────────────────────────────────
    # evaluation loop – reuse good cached answers, fix the bad ones
    # ────────────────────────────────────────────────────────────────────
    results: list[Any] = []
    print(f"Evaluating {len(test_data)} questions with method {args.method}...")
    print(len(list(zip(test_data, reasoning_chains))))
    for idx, (question, chain) in enumerate(zip(test_data, reasoning_chains)):
        # pull cached value if it exists and is valid
        if args.method == "2" or args.method == "3":
            cached_val = _parse_answer(cached[idx]) if idx < len(cached) else None
        elif args.method == "4":
            cached_val = _parse_probability_list(cached[idx], length=10) if idx < len(cached) else None
        elif args.method == "5":
            cached_val = _parse_probability_list(cached[idx], length=7) if idx < len(cached) else None
        elif args.method == "6":
            cached_val = _parse_probability_list(cached[idx], length=10) if idx < len(cached) else None
        elif args.method == "7":
            cached_val = _parse_probability_list(cached[idx], length=10) if idx < len(cached) else None
        if cached_val is not None:
            results.append(cached_val) 
            print(f"Using cached value for {idx}: {cached_val}")
            continue
        else:
            print(f"Evaluating {idx} with question: {question[1]}")

        # otherwise re-evaluate
        if args.method == "1":
            res = evaluate_chain(model, question[1], chain)
        elif args.method == "4":
            res = evaluate_confidence(model, question[1], chain)
        elif args.method == "5":
            res = evaluate_cot(model, question[1], chain)
        elif args.method == "6":
            res = evaluate_choose(model, question[1], chain)
        elif args.method == "7":
            res = evaluate_conservative(model, question[1], chain, question[0])
        else:
            res = evaluate_question(model, question[1], chain, question[0])
        results.append(res)

    # overwrite / create the file
    with open(result_path, "w") as f:
        json.dump(results, f)


if __name__ == "__main__":
    # set up argument parser
    args = argparse.ArgumentParser()
    args.add_argument("--dataset", type=str, default= "bbeh_geometric_shapes")
    args.add_argument("--model", type=str, default="gemini-2.0-flash")
    args.add_argument("--method", default="4")
    args.add_argument("--eval", action="store_true")
    args = args.parse_args()


    main(args)