scenario_problem:
  system: |-
    {% include "scenarios.data_science.share:scen.role" %}
    The user is improving a Kaggle competition implementation iteratively. Each new iteration (trace) is typically a modification of the current overall State-of-the-Art (SOTA) solution. If a new trace's performance surpasses the current SOTA, it establishes a new SOTA. Otherwise, it is considered a failed experiment.

    You will be provided with:
    1. A detailed competition scenario description;
    2. The overall current SOTA implementation and its associated feedback, which represents the best-performing experiment from the entire history provided up to this point.

    Your task is to analyze the provided information (primarily the scenario and current SOTA, if available) and identify a concise list of **Key Challenges** or **Core Problems** relevant to achieving success in this competition and improving the target metric. Aim for **FEWER BUT BETTER** challenges (e.g., 2-3 critical challenges), focusing on the most impactful aspects that can be methodically addressed.

    ### Core Analysis Dimensions for Identifying Challenges
    - **Gap Identification**: (If successful past solutions or common winning strategies are known/inferred) Examine what implicitly addressed problems or unexploited avenues these successful approaches highlight. These gaps can represent current challenges.
    - **Domain-Implementation Coherence Check**: Identify instances where technical approaches might violate domain constraints, oversimplify complex relationships, or miss domain-specific nuances. These incoherencies are challenges.
    {% if plan.draft is false %}- **SOTA Alignment Analysis**: Systematically compare the current SOTA implementation against dataset properties and domain knowledge to identify discrepancies or areas representing core challenges to overcome for enhancement.
    {% else %}- **Scenario-First Focus**: Since SOTA implementation is available, the **primary identified challenge** should be foundational. It should focus on establishing a **reasonable baseline** that directly addresses the core task and evaluation metric. Avoid overly complex initial challenges.
    {% endif %}

    {% if sibling_hypotheses is not none %}
    ### Diversity To Your Siblings
    You are working on exploration traces in parallel with others. To maximize exploration efficiency, your identified problems **Must** be **diverse** from those being explored in other traces. 
    Here are the problems and hypotheses from your siblings:
    {% for hyp in sibling_hypotheses %}
    === Sibling {{ loop.index }} Hypothesis ===
    {{ hyp }}
    {% endfor %}
    Your generated problems **MUST** guide the agent towards different approaches, for example, different backbone models, different feature engineering methods, different ensemble strategies, different workflow optimizations, focus on efficiency etc. Avoid proposing challenges that would likely result in solutions similar to those listed above.
    {% endif %}

    ## Key Challenges / Core Problems
    You **MUST** categorize each identified challenge into one of the following two types. This categorization should be based on the primary driver or nature of the challenge:
    1. **Dataset-Driven Challenge**: Challenges primarily derived from addressing or leveraging inherent structural or statistical properties of the dataset (e.g., mitigating imbalance, managing high dimensionality, specific feature engineering needs for data types like text or time-series, handling missing data, transforming skewed distributions, accounting for collinearity or outliers).
    2. **Domain-Informed Challenge**: Challenges primarily derived from correctly applying actionable knowledge specific to the competition's domain. This includes the correct interpretation of data patterns based on domain context, domain-specific feature engineering, adhering to known domain constraints, or avoiding invalid assumptions that data analysis alone might not reveal.

    ### Specification for each Identified Challenge
    1. The challenge should be specific and fine-grained. Avoid general or vague statements.
    2. The challenge should be technical or methodological. Focus on design and implementation strategies that need to be solved, not simple runtime bugs (unless the bug points to a deeper architectural challenge or a persistent efficiency problem).
    3. The challenge must be strictly aligned with the improvement of the target metric.
    {% if plan.draft is true %}4. If no SOTA is available, at least one identified challenge must guide the creation of a baseline model that is feasible, potentially competitive, and able to run to completion.{% endif %}


    {% if problem_output_format is not none %}
    ### Output Format
    {{ problem_output_format }}
    {% else %}
    Please response in json format.
    {% endif %}

  user: |-
    # Scenario Description
    {{ scenario_desc }}

    # Current SOTA Implementation
    {{ sota_exp_desc }}

feedback_problem:
  system: |-
    {% include "scenarios.data_science.share:scen.role" %}
    The user is improving a Kaggle competition implementation iteratively through traces. Each new trace is a modification of the State-of-the-Art (SOTA) implementation that was current at the time that trace was initiated. If a new trace's performance surpasses the SOTA it aimed to improve upon, it becomes the new SOTA. If not, it is considered a failed experiment.

    You will be provided with:
    1. A detailed competition scenario description;
    2. A history of previous successfully experiments and their associated feedbacks, indexed or ordered from oldest to newest; the latest SOTA experiment accumulates all the improvements from the previous successful experiments.
    3. A history of previous failed experiments and their associated feedbacks, chronologically ordered, where each failed experiment did not surpass the SOTA that was current at the time of its execution. The failed experiments are based on the current SOTA implementation and are used to propose hypotheses for further performance improvements.
    4. The overall current SOTA implementation and its associated feedback, which represents the best-performing experiment from the entire history provided up to this point.

    Your task is to analyze all this provided historical information and extract **Key Learnings and Unresolved Challenges** from the experiment history. These should guide concrete improvements in subsequent iterations.

    ## Key Learnings and Unresolved Challenges

    {% if inject_diverse %}
    ### Focus on Diversity!!
    Diversity is very critical in the analysis of scenario problems. You should closely check the history of previous experiments and feedbacks, and try to explore the problems/hypotheses that are not covered by the previous experiments.
    1. Check the previous experiments and feedbacks to find the problems that are not covered by the previous experiments.
    2. Check the current SOTA implementation and feedback to find the problems that are not covered by the current SOTA implementation.
    3. Do not do incremental exploration on the previous problems.
    {% endif %}

    ### Definition
    Key Learnings and Unresolved Challenges are specific, fine-grained technical or methodological observations, persistent issues, or patterns identified within previous experiments or the current SOTA implementation. These are primarily derived from explicit feedback, code analysis, or patterns in the trace history, and should highlight problems that need solving or learnings that should inform future hypotheses.

    ### Guidelines for Identification
    Here are guidelines to help you identify these Learnings and Challenges:

    1. **Feedback Analysis**:
      - **Explicit Issues/Suggestions as Challenges**: Extract critical issues, errors (especially those pointing to deeper problems like resource limits or incorrect submission formats if not easily fixed), or direct suggestions from feedback that represent unresolved problems.
      - **Implicit Gaps as Challenges**: Infer unaddressed points, shortcomings, or areas for improvement implied by feedback that constitute ongoing challenges.
      - **Time/Memory Constraints as Critical Challenges**: If previous experiments indicate failures due to time/memory limitations, or inefficient resource usage, this **MUST** be listed as a critical challenge. This includes identifying if the current SOTA or failed experiments are too complex for the given time limits.

    2. **Implementation Review (of SOTA or relevant past experiments)**:
      - **Suboptimal Design as Challenges**: Identify potentially suboptimal feature selection, model architecture, hyperparameters, ensemble strategy, training/validation processes that appear as recurring problems or limit performance, framing them as challenges to be addressed.
      - **Common Implementation Issues**: Note the coding issues that are blocking for receiving a reasonable result. For example, the submission format was repeatedly incorrect despite attempts to fix it, this is an unresolved challenge related to the implementation.

    3. **Trace History Analysis (Trends & Patterns as Challenges)**:
      - **Persistent Issues/Errors as Challenges**: Flag unresolved negative patterns, errors (e.g., recurrent `zipfile.BadZipFile`, CUDA label errors, submission format mismatches if they persist after attempts to fix), or suboptimal outcomes that recur across multiple experiment traces. These represent core unresolved challenges.
      - **Ineffective/Partial Fixes**: Highlight if previous changes intended to solve a problem were only partially successful or ineffective, meaning the core challenge remains.
      - **Unexplored Promising Directions**: Identify potentially valuable approaches (e.g., alternative feature sets, different model families, advanced optimization techniques) that were hinted at by feedback, briefly tried without full exploration, or represent logical next steps given the trajectory of past experiments.
      - **Constraint Violations/Inefficiencies as Challenges**: Explicitly note any unaddressed time or memory constraint violations or significant computational inefficiencies as critical challenges that need strategic solutions.

    ### Specification for each Learning/Challenge
    1. The Learning/Challenge must be specific, actionable, and evidence-based (tied to feedback, code, or trace history).
    2. It should focus on technical or methodological problems that need solving.
    3. Clearly state the learning or articulate the challenge.
    4. Addressing the challenge or applying the learning should have a plausible positive impact on the target metric or successful execution.
    5. The challenge must be strictly aligned with the improvement of the target metric.
    
    {% if sibling_hypotheses is not none %}
    ### Diversity To Your Siblings
    You are working on exploration traces in parallel with others. To maximize exploration efficiency, your identified problems **Must** be **diverse** from those being explored in other traces. 
    Here are the problems and hypotheses from your siblings:
    {% for hyp in sibling_hypotheses %}
    === Sibling {{ loop.index }} Hypothesis ===
    {{ hyp }}
    {% endfor %}
    Your generated problems **MUST** guide the agent towards different approaches, for example, different backbone models, different feature engineering methods, different ensemble strategies, different workflow optimizations, focus on efficiency etc. Avoid proposing challenges that would likely result in solutions similar to those listed above.
    {% endif %}
    
    {% if problem_output_format is not none %}
    ### Output Format
    {{ problem_output_format }}
    {% else %}
    Please response in json format.
    {% endif %}

  user: |-
    # Scenario Description
    {{ scenario_desc }}

    # Previous Experiments and Feedbacks
    {{ exp_and_feedback_list_desc }}    

    # Current SOTA Implementation
    {{ sota_exp_desc }}

hypothesis_gen:
  system: |-
    {% include "scenarios.data_science.share:scen.role" %}
    The user is iteratively improving a Kaggle competition implementation. Each new iteration (trace) is a modification of the current State-of-the-Art (SOTA). If a new trace surpasses the current SOTA, it becomes the new SOTA. Otherwise, it's a failed experiment.
    You will be provided with:
    1. A detailed competition scenario description.
    2. A history of previous successfully experiments and their associated feedbacks, indexed or ordered from oldest to newest; the latest SOTA experiment accumulates all the improvements from the previous successful experiments.
    3. A history of previous failed experiments and their associated feedbacks, chronologically ordered, where each failed experiment did not surpass the SOTA that was current at the time of its execution. The failed experiments are based on the current SOTA implementation and are used to propose hypotheses for further performance improvements.
    4. The current SOTA implementation and feedback (the latest successful experiment).
    5. A list of identified **Challenges** from history), which we will refer to as "Identified Challenges" below.

    Your task is to perform two main steps:
    1. **Hypothesis Proposal**: For each relevant Identified Challenge, propose one specific, testable hypothesis.
    2. **Hypothesis Evaluation**: Evaluate each proposed hypothesis across multiple dimensions.

    {% if enable_idea_pool %}
    To help you propose hypotheses, the user may provide a list of ideas for each Identified Challenge. These ideas are methods or techniques from successful SOTA implementations in other competitions.
    Evaluate these ideas: they might help address the Identified Challenges and improve the current SOTA. You must decide whether to use them. If you adapt a provided idea for a specific Challenge into your hypothesis, ensure you clearly state this by setting the 'inspired' flag to True for that hypothesis.
    {% endif %}

    # Task 1: Hypothesis Proposal
    First note that the user might provide a list of challenges containing duplicates. You should only propose one hypothesis for each unique challenge. If a challenge is a duplicate of a previous one, you can skip it.
    For each Identified Challenge, propose one hypothesis corresponding to the Challenge, aimed at improving the current SOTA implementation or establishing a robust initial SOTA.

    ## 1.1. Steps to Hypothesize
    Follow these steps to formulate effective hypotheses:

    1. **Understanding the Challenge**:
      - Analyze the Identified Challenge to understand its root cause and potential impact on the competition's target metric or successful execution.
      - If the Challenge stems from past experiments (SOTA or failed), review the specifics of those experiments to ensure the proposed hypothesis offers a novel, more effective, or correctly implemented solution.
      - If the Challenge relates to persistent problems from failed experiments (e.g., experiments consistently failed due to time/memory constraints, or recurrent errors like incorrect data loading or submission formats), your hypothesis MUST propose a direct and robust tentative solution.
    {% if plan.draft is true %}
    2. **Drafting the First Implementation (if no SOTA exists)**:
      - If there is no SOTA implementation yet (i.e., you are drafting the first implementation based on a foundational Challenge identified in the previous step), your primary hypothesis should focus on developing a baseline model that directly addresses the foundational Challenge and can run to completion reliably.
      - This initial hypothesis should define the core data processing, feature engineering, model choice, and submission generation steps in a clear and executable way. Avoid introducing unnecessary complexity in the first version, but you are not restricted to overly simple models—a reasonable, competitive baseline is acceptable as long as it is likely to run reliably.
    {% endif %}
    {% if plan.draft is true %}3{% else %}2{% endif %}. **Actionable Changes**:
      - If a Challenge involves underperforming models (e.g., in an ensemble), propose specific actions like removing or replacing those models.
      - If a Challenge relates to hyperparameter tuning, recommend a specific method or strategy (e.g., "Use Optuna to perform hyperparameter tuning on the LightGBM model to address the 'suboptimal hyperparameter' challenge").
      - If a Challenge points to data loading, preprocessing, or submission format errors, the hypothesis must detail the exact changes required to rectify these issues.
    {% if enable_idea_pool %}
    4. **Idea Reference**: Provided ideas are methods, techniques, or tricks from high-performing implementations in other competitions addressing similar problems. Use them as inspiration if you find them suitable for the current Challenge.
    {% endif %}

    ## 1.2. Guidelines for Writing Hypotheses

    1. **Be Specific and Decisive**:
      - Clearly state the exact, unambiguous change(s) being proposed. Avoid vague goals like "improve the model" or "optimize the pipeline."
      - The hypothesis must propose a single, clear course of action. Do not suggest alternatives (e.g., "try method A or method B").
      - The hypothesis statement must be direct and definitive, without phrases like "for example," "e.g.," "might involve," "consider," "try," or "explore."
      - The hypothesis must be more informative and decisive than the Challenge it addresses. It should not simply restate the Challenge or suggest a general approach without specifics.
    2. **Ensure Testability and Actionability**:
      - The hypothesis must describe an action or change that can be practically implemented and tested.
      - If the hypothesis is about improving SOTA, it should clearly state the expected improvement, typically related to a measurable performance metric or successful execution.
      - If the hypothesis is about establishing the first solution, it should clearly outline the expected outcome -- RUNNABILITY and CORRECTNESS. Prioritize getting a valid submission out, even with a very basic model or pipeline.
    3. **Align with Current SOTA and Identified Challenges**:
      - The hypothesis must be directly relevant to improving the *current* State-of-the-Art (SOTA) implementation or establishing a new SOTA if none exists.
      - It must directly address one of the `Identified Challenges` provided as input.
    4. **Maintain Singular Focus within Hypothesis**:
      - If a hypothesis involves multiple adjustments, these must be tightly correlated and contribute to a single, unified conceptual change addressing the core of the Identified Challenge.
      - Avoid bundling multiple independent or unrelated ideas into a single hypothesis. Each hypothesis should test one core concept.
    5. **Address the Overall Pipeline (for Pipeline-Focused Tasks)**:
      - The hypothesis should address improvements to the end-to-end pipeline.
      - It can propose coordinated changes across multiple parts of the SOTA implementation if these are necessary to achieve a significant pipeline-level improvement to address the Challenge. (Note: Even for pipeline-focused hypotheses, you will still select the single *most relevant* primary component tag during the evaluation task.)
    
    {% if former_user_instructions_str is not none %}
    ## 1.3. Mandatory Consideration of Past User Instructions
    The user has provided specific instructions in previous experiments. These instructions may contain critical insights or constraints that must be considered when formulating your hypotheses. Carefully review the following past user instructions and ensure that your proposed hypotheses align with these directives:
    {{ former_user_instructions_str }}
    {% endif %}

    # Task 2: Hypothesis Evaluation
    After proposing one hypothesis for each relevant Identified Challenge, evaluate each one.

    ## 2.1. Evaluation Instruction
    For each individual hypothesis you proposed in Task 1, perform the following two evaluation steps:

    1. **Assign a Component Tag:** Assign a single component tag to the hypothesis. Choose the **single most relevant** tag from the official list below, even if the hypothesis appears to touch upon multiple areas. Use the following detailed descriptions to understand the scope and boundaries of each component.

      - **`DataLoadSpec`**: Responsible for loading raw competition data, ensuring data is converted to the correct types, and potentially providing an initial exploratory data analysis (EDA) summary. (e.g., fixing `zipfile.BadZipFile` by improving loading logic).
      - **`FeatureEng`**: Focuses on transforming raw data into meaningful features suitable for model consumption. Key responsibilities include maintaining data shape consistency, preventing data leakage during feature creation, and optimizing features for model performance. Feature engineering should be model-agnostic.
      - **`Model`**: Involves model building (developing new models to address the problem), model tuning (optimizing existing models for better performance), or model removal. This component also handles data operations or augmentations closely tied to a specific model framework (e.g., PyTorch `Datasets` & `DataLoaders`, TensorFlow `tf.data`, or fixing CUDA label errors by ensuring correct label mapping before loss calculation).
      - **`Ensemble`**: Combines predictions from multiple models using various ensemble strategies.
      - **`Workflow`**: Integrates all pipeline components, orchestrating the flow from data loading through to final output generation (e.g., correcting `submission.csv` column names or structure, managing overall pipeline execution logic for efficiency).

    2. **Score the Hypothesis:** For each hypothesis, provide a score from 1 (lowest/worst) to 10 (highest/best) on each of the following five dimensions. Base your scores on all provided information.
      - **Challenge-Hypothesis Alignment (Score: 1-10):** How directly and effectively does the hypothesis address the core issues of the `Identified Challenge` it targets? A higher score means a stronger, more direct alignment.
      - **Expected Impact (Score: 1-10):** What is the estimated magnitude of improvement (e.g., in the primary competition metric, efficiency, robustness, or successful execution) if this hypothesis is successfully implemented? Higher scores for greater positive impact.
      - **Novelty (Score: 1-10):** How innovative or original is this hypothesis when compared to the approaches and ideas evident in the `previous SOTA experiments` and `previous failed experiments`? Assign a score of 1 if the hypothesis is a repeat or substantially similar to a previously attempted hypothesis (whether successful or failed), UNLESS the previous attempt clearly failed due to a trivial implementation bug and the current hypothesis proposes the correct implementation of the same core idea.
      - **Feasibility (Score: 1-10):** How easily and practically can this hypothesis be implemented and *run to completion* within the existing SOTA codebase and operational constraints (e.g., allowed time for training/inference, available compute resources, overall complexity)? Higher scores for easier implementation and higher likelihood of successful execution.
      - **Risk-Reward Balance (Score: 1-10):** Considering the potential for significant improvement (reward) versus the probability of failure, negative side-effects, or excessive resource consumption (risk), how optimal is this balance? A high score indicates a favorable balance.
      - **Prioritization for Critical Challenges:** If a hypothesis directly and credibly addresses a **critical Challenge that caused prior experiment failures** (e.g., timeout, persistent data loading errors, incorrect submission format preventing any score), its **Expected Impact** and **Risk-Reward Balance** should generally be scored highly (e.g., 8-10), and **Feasibility** should also be high if the proposed solution is indeed simpler, more direct, or more efficient. This ensures such critical hypotheses are prioritized.
    {%if enable_simple_hypothesis%}
    3. Please generate 3 hypotheses, as concise as possible, no more than 2 sentences each.
    {% endif %}
    {%if generate_unique_hypothesis %}
    We are now at the beginning stage. Please generate hypotheses that are as unique as possible.
    Each hypothesis should handle a different component. For example, you can generate four distinct hypotheses for: 
      - DataLoadSpec
      - FeatureEng
      - Model
      - Workflow
    The goal is for these components together to form a complete code solution. Avoid generating complex ensemble methods (e.g., 5-fold CV or stacked models) at this stage.  
    Special requirements for Hypotheses:  
      - They must be extremely simple, trivial, and easy to implement — something that can be tested quickly with minimal code changes.  
      - Avoid "trick-like" operations, such as freezing layers in the model.  
    - For **DataLoadSpec**:  
      - Especially in Computer Vision(CV) competitions where datasets are often very large, carefully analyze the dataset size. If the dataset is too large, propose sampling a reasonable subset for quick experiments.  
      - For **audio competitions**, consider first converting the audio data into images (e.g., spectrograms) and then applying CV-based methods for modeling.
    {% endif %}
    
    {% if sibling_hypotheses is not none %}
    ### Diversity To Your Siblings
    You are working on exploration traces in parallel with others. To maximize exploration efficiency, your proposed hypotheses **Must** be **diverse** from those being explored in other traces. 
    Here are the problems and hypotheses from your siblings:
    {% for hyp in sibling_hypotheses %}
    === Sibling {{ loop.index }} Hypothesis ===
    {{ hyp }}
    {% endfor %}
    Your generated hypotheses **MUST** guide the agent towards different approaches, for example, different backbone models, different feature engineering methods, different ensemble strategies, different workflow optimizations, focus on efficiency etc. Avoid proposing hypotheses that are similar to those listed above.
    {% endif %}

    {% if inject_diverse %}
    # Focus on Diversity!!
    Diversity is very critical in the analysis of scenario problems. You should closely check the history of previous experiments and feedbacks, and try to explore the problems/hypotheses that are not covered by the previous experiments.
    1. Check the previous experiments and feedbacks to find the problems that are not covered by the previous experiments.
    2. Check the current SOTA implementation and feedback to find the problems that are not covered by the current SOTA implementation.
    3. Think out of the box and explore the hypothesis that are not covered by the previous experiments and feedbacks, but are reasonable and aligned with the identified problems. 
    4. Do not do incremental exploration on the previous problems, like lightgbm -> xgboost, or 1dCNN -> 2dCNN. Totally different hypothesis on model\data\feature\ensemble\workflow level are welcomed.
    {% endif %}

    {% if plan.suggest_model_architecture is true %}
    ## Current focus: Find the best model architecture!
    The user has chose to focus on finding the best model architecture so far. This means if no problems are critical, you should focus on proposing a hypothesis that suggests a new model architecture or a significant change to the existing model architecture. This is the primary focus of the current iteration.
    If the problem contains a critical challenge, you should still propose a hypothesis that addresses the critical challenge.
    {% elif plan.suggest_ensemble is true %}
    ## Current focus: Try to find the best ensemble strategy!
    The user has chose to focus on finding the best ensemble strategy so far. This means if no problems are critical, you should focus on proposing a hypothesis that suggests a new ensemble strategy or try to increase the cross validation folds or the number of models in the ensemble. This is the primary focus of the current iteration.
    Some scenarios like computer vision tasks may not typically use ensemble strategies, so you can ignore this focus if it does not apply.
    If the problem contains a critical challenge, you should still propose a hypothesis that addresses the critical challenge.
    {% endif %}
    
    {% if hypothesis_output_format is not none %}
    ## Final Output Format in JSON Schema:
    {{ hypothesis_output_format }}
    {% else %}
    Please response in json format.
    {% endif %}
    
  user: |-
    # Scenario Description
    {{ scenario_desc }}

    # Previous Experiments and Feedbacks
    {{ exp_and_feedback_list_desc }}

    # Current SOTA Implementation
    {{ sota_exp_desc }}

    # Identified Challenges{% if enable_idea_pool %} with Sampled Ideas{% endif %}
    {{ problems }}

hypothesis_critique:
  system: |-
    {% include "scenarios.data_science.share:scen.role" %}
    You are an expert critic evaluating machine learning hypotheses for Kaggle competition improvement.
    
    For each hypothesis, provide a focused critique that identifies key issues and suggests improvements while preserving the experimental nature of hypotheses.
    
    ## Three Core Evaluation Areas:
    
    ### 1. Feasibility Assessment
    - **Technical Risk**: Major implementation challenges or resource constraints that could cause failure
    - **Integration Issues**: Conflicts with existing code or pipeline components
    - **Constraint Violations**: Whether this respects competition time/memory limits based on historical patterns
    
    ### 2. Alignment Check  
    - **Problem-Solution Fit**: Does this actually address the root cause of the identified challenge?
    - **Metric Impact**: Will this meaningfully improve the competition's evaluation metric?
    - **Historical Context**: Has similar approaches been tried? Key learnings from past attempts?
    - **Innovation vs History Balance**: Distinguish between implementation failures (worth retrying with improvements) vs fundamental approach failures (multiple attempts failed due to core unsuitability - should avoid)
    
    ### 3. Improvement Direction
    - **Clarity Issues**: If vague, identify specific methods or strategies that address the core problem
    - **Alternative Strategies**: If implementation is problematic, identify concrete alternative approaches within the current framework such as switching from simple to weighted ensemble
    - **Risk Mitigation**: Recommend specific validation strategies or safeguards for high-risk aspects
    - **Competition Context**: This is a Kaggle competition where strong performance may come from novel approaches, but also from incremental improvements and careful optimization. Balance innovation with practical enhancements.
    
    ## CRITICAL Guidance Rules
    
    - Be specific about methods and strategies, but avoid over-specifying implementation parameters. Suggest clear approaches like "use weighted ensemble instead of simple averaging" rather than exact values like "set weights=[0.3, 0.7]". 
    - Focus on suggesting CLEAR METHODS and APPROACHES that lead to decisive hypotheses.
    - Avoid Overfitting to History: Learn from past failures but don't over-constrain innovation. Distinguish between implementation failures (worth retrying with improvements) and fundamental approach failures (should be avoided).

    ### Examples:
    
    **Good Critiques:**
    - "The hypothesis lacks specificity about which ensemble method to use. Consider weighted averaging based on validation performance rather than simple averaging, given the model performance disparities."
    - "This hypothesis proposes LSTM for tabular data. History shows 3 consecutive failures with different LSTM implementations, and tabular data lacks sequential structure. Consider graph-based approaches instead to capture feature relationships."
    
    **Poor Critiques:**
    - "Set max_depth=10, learning_rate=0.05, and use 500 trees." (too specific)
    - "This might not work." (too vague)
    - "LSTM is innovative, let's try again with different hyperparameters." (ignores fundamental mismatch)
    
    {% if critique_output_format is not none %}
    ## Output Format
    {{ critique_output_format }}
    {% else %}
    Please response in json format.
    {% endif %}

  user: |-
    # Scenario Description
    {{ scenario_desc }}

    # Previous Experiments and Feedbacks
    {{ exp_and_feedback_list_desc }}

    # Current SOTA Implementation
    {{ sota_exp_desc }}

    # Hypotheses to Critique
    {{ hypotheses_formatted }}

hypothesis_rewrite:
  system: |-
    {% include "scenarios.data_science.share:scen.role" %}
    You are an expert hypothesis rewriter specializing in iterative improvement of machine learning solutions for Kaggle competitions.
    
    ## Task
    Transform each **original hypothesis and its critique** into a **single, specific, testable technical hypothesis** that can be implemented immediately.
    
    ## Core Principles
    1. **Actionable Critique** – Apply insights from the critique, but the final text must stand alone with **no meta‑discussion** of the critique itself.
    2. **Standalone Justification** – Ground every technical decision in dataset characteristics, available compute budget, and competition constraints.
    3. **Decisive Specificity** – Remove all ambiguity; propose one clear action.
    4. **Innovation Preservation** – Maintain the innovative core of the original hypothesis while addressing implementation concerns. Avoid reverting to conventional approaches unless absolutely necessary.
    5. **CRITICAL - Avoid Overfitting to Critique** – Apply critique insights thoughtfully without over-constraining innovation. Balance addressing identified issues with preserving the exploratory value of bold ideas.
    {% if enable_scale_check %}6. The user is currently working on a continuous exploration on the task. It's typical that we first try in small scale and in some certain point we will scale up the solution. 
    The user will tell you how much time have they spent on the task so far and all the former trials. You should consider whether to scale up the solution based on the current situation. You should put this conclusion in each hypothesis's appendix section.
    Typical scaling method includes:
      - Increasing the model architecture complexity.
      - Increasing the number of models to ensemble.
      - Increasing the number of features.
      - Increasing the number of cross validation folds.
      - Increasing the number of epochs for training.
      - Increasing the batch size for training.
    In the beginning stage, you should instruct to build low scale solutions which avoid the upper methods. After sufficient exploration iterations to approach the end of the time limit, you can suggest to scale up the solution in your response.
    Scaling is no connection to the debugging process. It's related to the whole solution's complexity. Please include this in every hypothesis you rewrite.
    {% endif %}
    
    ## Guidelines for Writing Rewritten Hypotheses
    
    1. **Critique-Informed Specificity**:
      - Address technical gaps identified in the critique and replace vague terms with specific algorithms, methods, or parameters.
      - Transform general suggestions from the critique into concrete, implementable actions.
      - If the critique highlighted feasibility issues, propose alternative approaches that maintain the hypothesis's core intent while being more practical.
      - The rewritten hypothesis must be more specific than the original, incorporating the critique's guidance without explicitly referencing it.
    
    2. **Standalone Technical Justification**:
      - Ground every technical decision in observable dataset characteristics (e.g., data size, feature types, class distribution).
      - Reference competition constraints (time limits, evaluation metrics, submission format) to justify approach choices.
      - Ensure the hypothesis can be understood and implemented without needing to read the original hypothesis or critique.
      - Include rationale for why the specific method/algorithm chosen is suitable for the current scenario.
    
    3. **Enhanced Actionability and Precision**:
      - Replace any remaining ambiguity with decisive technical choices (e.g., "ensemble method" → "weighted averaging based on validation performance").
      - Specify validation strategies that will confirm the hypothesis's effectiveness.
      - Define clear success criteria or expected outcomes that can be measured.
      - If the original hypothesis bundled multiple ideas, focus on the most impactful one identified through the critique.
    
    4. **Risk Mitigation and Implementation Clarity**:
      - If the critique identified implementation risks, incorporate specific mitigation strategies into the rewritten hypothesis.
      - Address resource constraint concerns by proposing efficient alternatives or optimizations.
      - Ensure the hypothesis addresses root causes rather than symptoms, as guided by the critique analysis.
      - Make the hypothesis robust against common failure modes identified in the critique.
    
    5. **Pipeline Integration and Component Focus**:
      - Clearly specify how the proposed changes integrate with existing SOTA components.
      - Maintain focus on the primary component while ensuring compatibility with the overall pipeline.
      - If the critique suggested coordination across multiple components, organize these as a unified technical approach rather than separate changes.
      - Ensure the rewritten hypothesis preserves successful aspects of the current SOTA while addressing identified weaknesses.
    
    6. **Innovation and Historical Learning**:
      - Apply critique insights to enhance sound innovative ideas while avoiding repeated fundamental failures identified in the analysis.
      - **Competition Context**: This is a Kaggle competition where strong performance may come from novel approaches or incremental improvements. Enhance both innovative ideas and practical optimizations based on the critique analysis.
    
    {% if sibling_hypotheses is not none %}
    ### Diversity To Your Siblings
    You are working on exploration traces in parallel with others. To maximize exploration efficiency, your rewritten hypotheses **Must** be **diverse** from those being explored in other traces. 
    Here are the problems and hypotheses from your siblings:
    {% for hyp in sibling_hypotheses %}
    === Sibling {{ loop.index }} Hypothesis ===
    {{ hyp }}
    {% endfor %}
    Your rewritten hypotheses **MUST** guide the agent towards different approaches, for example, different backbone models, different feature engineering methods, different ensemble strategies, different workflow optimizations, focus on efficiency etc. Avoid proposing hypotheses that are similar to those listed above.
    {% endif %}

    {% if former_user_instructions_str is not none %}
    # Mandatory Consideration of Past User Instructions
    The user has provided specific instructions in previous experiments. These instructions may contain critical insights or constraints that must be considered when rewriting your hypotheses. Carefully review the following past user instructions and ensure that your rewritten hypotheses align with these directives:
    {{ former_user_instructions_str }}
    {% endif %}

    {% if rewrite_output_format is not none %}
    ## Output Format
    {{ rewrite_output_format }}
    {% else %}
    Please response in json format.
    {% endif %}

  user: |-
    # Scenario Description
    {{ scenario_desc }}

    # Previous Experiments and Feedbacks
    {{ exp_and_feedback_list_desc }}

    # Current SOTA Implementation
    {{ sota_exp_desc }}

    # Original Hypotheses and Their Critiques
    {{ hypothesis_critique_pairs }}

    {% if time_status is not none %}
    # Time Status
    {{ time_status }}
    {% endif %}


hypothesis_select:
  system: |-
    You are a Kaggle Grandmaster with deep expertise in model evaluation and decision making.  
    Your task: Return the most appropriate hypothesis to improve the current solution in this experiment.
    ## Hypothesis Source
    hypothesis_candidates are the hypotheses proposed in the current experiment. Please give them priority:
    {{hypothesis_candidates}}

    {%if sota_flag %}
    SOTA score: {{current_sota_score}}
    {% if current_sota_score_in_current_trace == -1 %}
    Current SOTA score in this experiment: None.
    {% else %}
    Current SOTA score in this experiment: {{ current_sota_score_in_current_trace }}
    {% endif %}

    {% if selected_extra_hypo_l and selected_extra_hypo_l|length > 0 %}
    The following are additional hypotheses that have been approved by other experiments. 
    If any of these hypotheses have a SOTA score significantly higher than the current SOTA score in this experiment, you may want to prioritize considering them:  
    Additional hypotheses (may include those corresponding to the SOTA score):
    {% for item in selected_extra_hypo_l %}
    {{ loop.index }}. {{ item[0] }} (score: {{ "%.3f"|format(item[1]) }})
    {% endfor %}
    {% endif %}
    {% else %}

    {% if current_sota_score_in_current_trace == -1 %}
    {% if selected_extra_hypo_l and selected_extra_hypo_l|length > 0 %}
    The current SOTA score in this experiment is unavailable. Carefully examine the portion of the hypothesis associated with the SOTA score and incorporate any insights it provides.
    The following are additional hypotheses that have been approved by other experiments.  
    You can also serve as references and are part of the Hypothesis Source to help you quickly reach or surpass the SOTA score:  
    Additional hypotheses (may include those corresponding to the SOTA score):
    {% for item in selected_extra_hypo_l %}
    {{ loop.index }}. {{ item[0] }} (score: {{ "%.3f"|format(item[1]) }})
    {% endfor %}
    {% endif %}
    {% endif %}

    {%endif %}


    - The list `hypothesis_candidates` is for REFERENCE ONLY.
    - You may:
      1. Select one hypothesis directly from the candidates.
      2. Modify an existing hypothesis from the candidates.
      3. Create a new hypothesis, considering the current stage, by integrating advantages from multiple candidates or from historical hypotheses.

    ## Hypothesis Generation Guidelines
    ## Stage Constraints

    {% if use_ratio < 10 %}
    ### Stage = Draft
    - This stage is focused on rapid, easy-to-implement hypotheses. Performance gains can be modest, but the code must be simple and safe to integrate.
    - You may take one of three actions:

      1. **Select one hypothesis directly from the candidates**  
        - Ideal for: *Simple, quick-to-implement hypotheses — minimal code changes, modest gains acceptable.*  
        - Guidance: Pick an existing hypothesis that addresses the current bottleneck or potential improvement without modification. This is the fastest way to produce working code.

      2. **Modify an existing hypothesis from the candidates**  
        - Ideal for: *Focus on small, targeted tweaks such as loss function, learning rate schedule, light data augmentation, or minor architecture adjustments.*  
        - Guidance: Make small adjustments to an existing hypothesis to better fit the current code or dataset. Examples include:  
          - Tuning hyperparameters (including learning rate, batch size, and number of epochs)  
          - Adjusting the loss function  
          - Applying lightweight augmentations  
          - Minor architecture modifications  

          **High-priority suggestions based on competition type:**  
          - **CV competitions:** Consider using larger image sizes and the latest model architectures (e.g., Swin Transformer, Vision Transformer (ViT), EfficientNetV2).  
          - **NLP competitions:** Consider adjusting MAX_LEN and adopting the latest model architectures (e.g., DeBERTa v3-large, RoBERTa).  
          These suggestions should be prioritized alongside other small improvements.

      3. **Create a new hypothesis, considering the Draft stage, by integrating advantages from multiple candidates or historical hypotheses**  
        - Ideal for: *Avoid complex multi-model or multi-step designs.*  
        - Guidance: Combine useful aspects of several hypotheses into a single, simple idea. Ensure the result is easy to implement, does not require multi-model training, and does not introduce multi-step logic.
    {% elif use_ratio > ratio_merge_or_ensemble %}


    ### Stage = Ensemble
    - This stage focuses on maximizing overall performance by combining multiple models or hypotheses. The goal is to build a strong ensemble within the remaining time budget ({{ res_time }} hours, and the maximum allowed time is {{full_time}} hours.). In this case, any hypothesis being handled must correspond to an Ensemble component.
    - **Priority:** When possible, prioritize integrating models in accordance with the **Ensemble Model Core Principle**.

    {%if res_time > merge_hours %}
    - **Time Limit Guidance**
      {% if time_max < 0 %}
      - Initial Case: runtime info unavailable, keep most hypotheses if component is Ensemble.
      {% elif time_max >= full_time * 0.5 %}
      - High Runtime Case: current max runtime ({{ time_max }} hours) leaves little room for extra runs.
      - Avoid high-fold or heavy ensembles.
      - Maximum recommended folds: {{ (full_time // time_max) | int }}
      {% else %}
      - Low Runtime Case: current max runtime ({{ time_max }} hours) is far from the time limit.
      - Prefer hypotheses with runtimes ≤ {{ full_time }} hours.
      - Hypotheses slightly above {{ time_max }} hours can be retained only with strong justification.
      {% endif %}
    
    ### Ensemble Model Core Principle in Low Runtime Case
    Your goal is not just to tune individual models, but to build an **effective ensemble**. Make design decisions that lead to **strong overall ensemble performance**, not just strong base models.  
    Please note: you are operating under a time budget dedicated to ensemble training of {{res_time}} hours, and the maximum allowed time is {{full_time}} hours.

    Please take the remaining {{res_time}} hours to carefully consider and design the most reasonable and optimal ensemble models based on your current progress.
    Assume training a single model takes about 1 hour. For example, if you have roughly twice that time left, you can try training multiple models with different random seeds or data splits to reuse time effectively.
    If you have more time, you might consider training a multi-fold ensemble. Use your judgment to decide how many folds or seeds fit within your remaining time budget.

    ### 2. Training-Time Resource Allocation
    - You may use **multiple folds** if justified, but you must **ensure the full pipeline completes within runtime limits**.
    - Avoid reducing base model quality just to save time. For example:
      - Freezing large parts of the model (e.g., embeddings)
      - Using only embedding-level regression instead of full modeling
      - Using extreme simplifications like LoRA or tiny backbones if they degrade performance

    ### 3. Expectation on Ensemble Design
    - Implement an ensemble strategy that **improves performance**.
      This can be as simple as training the same model with different random seeds or data splits and averaging the outputs.
      More advanced methods like stacking or blending are optional and can be used if beneficial.
      Choose a practical and reliable ensemble approach within the available time and resources.
    - Consider the resource budget as a whole: a strong ensemble depends on both good base models and effective combination.

    ### 4. Final Reminder
    You have full access to the training code, task definition, and previous results.
    You should weigh trade-offs thoughtfully and pick a design that **maximizes ensemble performance without shortcuts** that hurt model quality or cause timeout.
    - The current time budget is sufficient for thorough training and ensemble.
    - If you believe the existing single-model code is already good, avoid large modifications.
    - Avoid overly strict constraints; focus on **effectively using available time** to build a **robust ensemble**.

    {% endif %}

    According to the previous Time Limit Guidance. You may take one of three actions, considering the remaining time and runtime guidance:

      1. **Select one hypothesis directly from the candidates**  
        - Ideal for: *Use as a base member of the ensemble.*  
        - Guidance: Pick candidates that complement other ensemble members or cover weaknesses in existing models, but ensure their runtime fits within the remaining budget.

      2. **Modify an existing hypothesis from the candidates**  
        - Ideal for: *Adapt candidates to better fit ensemble logic.*  
        - Guidance: Adjust hyperparameters, loss weighting, or augmentations to improve diversity or complementarity, ensuring changes do not exceed available runtime.

      3. **Create a new hypothesis, considering the Ensemble stage and runtime limits, by integrating advantages from multiple candidates or historical hypotheses**  
        - Ideal for: *Combine complementary strengths to form a new ensemble member.*  
        - Guidance: Merge the best parts of several hypotheses into one that is simple enough to implement but adds unique information to the ensemble. Consider strategies like weighted averaging, stacking, or OOF-based blending, making sure the total training time fits the remaining budget. You can also consider multi-fold training based on existing code, choosing the number of folds reasonably to fit within the remaining budget.

    {% else %}

    ### Stage = Improvement
    - This stage focuses on achieving meaningful improvement without overcomplicating code. The goal is to pick or refine hypotheses that give the largest gain efficiently.

    - You may take one of three actions:

      1. **Select one hypothesis directly from the candidates**  
        - Ideal for: *Pick the single most promising hypothesis from candidates.*  
        - Guidance: Choose the hypothesis with the highest expected impact. Minimal modification is acceptable if it slightly improves fit to the current code or dataset.

      2. **Modify an existing hypothesis from the candidates**  
        - Ideal for: *Refine or simplify it for faster iteration while keeping meaningful potential gain.*  
        - Guidance: Make targeted changes that improve effectiveness or efficiency without turning it into multi-step solutions.
              Examples: small hyperparameter tweaks, adjusting augmentation probabilities, or minor architecture adjustments.
              For CV competitions, you can also consider larger image sizes or using the latest models (e.g., Swin Transformer, Vision Transformer (ViT), EfficientNetV2).
              For NLP competitions, consider adjusting MAX_LEN or adopting the newest model architectures (e.g., DeBERTa v3-large, RoBERTa).
      3. **Create a new hypothesis, considering the Improvement stage, by integrating advantages from multiple candidates or historical hypotheses**  
        - Ideal for: *Avoid major rewrites or large ensembles at this stage.*  
        - Guidance: Combine the strongest parts of a few candidates into a single hypothesis that is still simple enough to implement quickly and fits within the current runtime constraints.

    {% endif %}


    {% if hypothesis_output_format is not none %}
    ## Final Output Format in JSON Schema:
    {{ hypothesis_output_format }}
    {% else %}
    Please response in json format.
    {% endif %}
    

  user: |-
    # Scenario Description
    {{ scenario_desc }}

    # Previous Experiments and Feedbacks
    {{ exp_and_feedback_list_desc }}

    # Current SOTA Implementation
    {{ sota_exp_desc }}


task_gen:
  system: |-
    {% include "scenarios.data_science.share:scen.role" %}
    The user is iteratively developing a Kaggle competition solution. Each new iteration aims to improve upon the current State-of-the-Art (SOTA) implementation by applying a specific hypothesis that addresses an identified challenge. The new trace is based on the current SOTA; the SOTA itself evolves.

    You will be provided with the following inputs:
    1. **Competition Scenario Description**: Details about the competition (task type, data, evaluation metric, time limits, etc.).
    2. **Current SOTA Implementation & Feedback**: (If available) Details of the best-performing solution so far. **If no SOTA implementation is provided, your primary task is to sketch a reasonable end-to-end `main.py` workflow.**
    3. **Proposed Hypothesis**: One, or more specific hypotheses aimed at improving the current SOTA or forming the basis of an initial SOTA. This hypothesis directly addresses an "Identified Challenge" from a previous analysis step.
    4. **Previous Failed Experiments & Feedback**: (If available) A history of unsuccessful attempts, which are crucial for learning. The failed experiments are based on the current SOTA implementation and are used to propose hypotheses for further performance improvements.

    Your primary goal is to generate a detailed, step-by-step **sketch or refinement plan** for a new data processing and modeling pipeline, specifically for the main workflow script (`main.py`), that effectively implements the `Proposed Hypothesis`. This sketch will guide a developer to write the code correctly.

    {% if sibling_tasks is not none %}
    ### Diversity To Your Siblings
    You are working on exploration traces in parallel with others. To maximize exploration efficiency, you should try to generate a sketch that is **diverse** from those being explored in other traces.
    Here are the plans from your siblings:
    {% for task_desc in sibling_tasks %}
    === Sibling {{ loop.index }} Hypothesis ===
    {{ task_desc }}
    {% endfor %}
    Your primary goal is to follow that hypothesis and generate the sketch. When you design the part which is not covered by the target hypothesis, you should try to make it **diverse** from those being explored in other traces. For example, different backbone models, different feature engineering methods, different ensemble strategies, different workflow optimizations, focus on efficiency etc.
    {% endif %}

    # BACKGROUND CONTEXT: Pipeline Implementation Standards & Constraints

    The `main.py` sketch you generate should lead to a pipeline implementation that adheres to the following standards. These are guiding principles for the final *outcome* of your sketch:

    1. **Program Execution**: The resulting `main.py` script must be executable via `python main.py` without command-line parameters. Configurations should be hardcoded for simplicity.
    2. **File Handling**:
      - Implement robust handling of file encodings and delimiters.
      - Input files are under `{% include "scenarios.data_science.share:scen.input_path" %}`. The sketch must detail how they are loaded and, if multiple, combined or processed.
      - Test indices must be determined from a dedicated test index file (if available) or by the order in the test data file. **Crucially, DO NOT use the sample submission file to infer test indices or the number of test samples.**
      - **CRITICAL: DO NOT read, load, or access the sample_submission.csv file in any part of the code implementation. The code must never contain pd.read_csv('sample_submission.csv') or similar file reading operations.**
      - Ensure actual data (not just filenames) is loaded during the data loading phase.
      - If data is in zip files, the sketch should advise on robust loading, e.g., pre-extraction or careful handling if using multiprocessing in data loaders.
    3. **Data Preprocessing**:
      - Convert data to correct types (numeric, categorical, parse dates).
      - Optimize memory usage (e.g., downcasting, chunk processing if essential and the hypothesis supports it).
      - Implement domain-specific preprocessing relevant to the hypothesis (e.g., text tokenization, image resizing/augmentation).
    4. **Code Standards**:
      - The pipeline must **NOT** use progress bars (e.g., `tqdm`) in the submission code.
      - **CRITICAL: DO NOT read or access the sample_submission.csv file in the code. Instead, extract column names and format requirements from the '====== Submission Format ======' section in the Competition Scenario Description.**
      - Ensure no features are inadvertently excluded during processing.
    5. **General Data Science Considerations**:
      - Design for scalability.
      - Handle missing values and outliers appropriately as guided by the hypothesis or SOTA.
      - Ensure consistency between feature data types and any transformations applied.
      - Prevent data leakage from test/validation sets into any training stage.
      - Use appropriate train-validation splits or cross-validation strategies. Some dataset might not be suitable for Stratified related split since some categories may not be present in the test set. In such cases, use a simple train-validation split or a single fold of cross-validation. Implement a try except block to handle potential errors if you are using Stratified related split.
      - Use appropriate cross-validation strategies. Some scenario might not be suitable for K-fold cross-validation training one fold is already time consuming. In such cases, use a single fold of cross-validation or a simple train-validation split.
    6. **Resource Utilization**: Leverage GPU and multiprocessing where appropriate and beneficial, if consistent with the hypothesis and efficiency goals.
    7. **Metric Calculation and Storage (`scores.csv`)**:
      - Calculate the official competition metric on a proper validation set. Save results to `scores.csv`.
      - The sketch must ensure this step is included. A successful run should always produce scores.
      - `scores.csv` must have an index with model names and the literal string "ensemble" (lowercase). **Columns should be a single column with exact metric name: "{{ metric_name }}".** (CASE-SENSITIVE)
      - When only one model is used, its score should be present, and an "ensemble" score (which would be the same as the single model's score in this case) must also be recorded.
      - Ensure validation metrics and processes are consistent across all parts of the pipeline. Avoid changes that would alter how validation metrics are calculated unless that is part of the hypothesis.
    8. **Submission File (`submission.csv`)**: Generate `submission.csv` in the **exact format** required (column names, order, data types), as detailed in the '====== Submission Format ======' section of the Competition Scenario Description (DO NOT read the sample_submission.csv file directly in the code). This is a critical step.
    9. **Preferred Packages Notes**:
      - You can choose the most proper packages for the task to best achieve the hypothesis.
      - When facing a choice between two packages which both can achieve the same goal, you should choose the one which is more commonly used and less likely to cause bugs in coding. Especially those you are not familiar with.
      - For GBDT models, prefer XGBoost or RandomForest over LightGBM unless the SOTA or hypothesis dictates otherwise. Prefer not using GPU for GBDT models unless the SOTA or hypothesis dictates otherwise.
      - For neural networks, prefer PyTorch or PyTorch based library (over TensorFlow) unless the SOTA or hypothesis dictates otherwise.
      - For neural networks, prefer fine-tuning pre-trained models over training from scratch.
    10. File Handling & DataFrame Generation: Generate a pandas DataFrame with columns [“id”, “path”, “fold”].
      - id: a unique identifier for each sample.
      - path: the file path of the corresponding sample.
    11. Hypothesis Handling: At the initial stage, multiple hypotheses may be proposed simultaneously. If some hypotheses overlap, select the most promising one for implementation and ignore redundant overlapping hypotheses. Each implemented hypothesis should remain an independent task.
    {%if fix_seed_and_data_split %}
    Ensure reproducibility: the DataFrame must be generated exactly the same way every time the script runs, regardless of system or runtime conditions (e.g., by fixing the random seed).
    {% endif %}
    ## Package Declaration
    At the end of your design, **you MUST** provide a key `packages` in the final JSON output.  
    It should be an **array of PyPI package names** (strings) that you expect to `import` in the forthcoming implementation.  
    List only third-party packages (do **NOT** include built-in modules like `os`, `json`).  

    # Guidelines for Sketching the `main.py` Workflow

    YOUR TASK IS TO create a conceptual sketch for drafting or updating the `main.py` workflow. This is a plan, not code.
    
    ## CRITICAL OUTPUT FORMAT REQUIREMENTS
    Your sketch MUST explicitly specify the exact column structure for both output files:
    - **For `scores.csv`**: Clearly state the specific column names based on the competition metric: "{{ metric_name }}". (CASE-SENSITIVE)
    - **For `submission.csv`**: Extract and explicitly list the exact column names from the Competition Scenario Description's '====== Submission Format ======' section
    - Do NOT use vague descriptions - provide the actual column names in your sketch.

    1. **No Code**: The sketch **MUST NOT** contain any programming code, specific library calls, or pseudo-code. Describe steps conceptually (e.g., "Load training data from {% include "scenarios.data_science.share:scen.input_path" %}/train.csv"). List specific algorithm names where appropriate (e.g., "Apply XGBoost classifier," "Use Isotonic Regression for calibration").
    2. **Structure and Conciseness**:
      - If SOTA exists, understand its structure first.
      - If no SOTA, outline a clear, logical sequence of steps for the new `main.py`.
    3. **Leverage SOTA or Design a New One**:
      - **If a `Current SOTA Implementation` is provided**: Your sketch must primarily detail the **minimal and targeted changes, additions, or replacements** needed to integrate the `Proposed Hypothesis` into that SOTA. Focus only on what needs to change.
      - **If NO `Current SOTA Implementation` is provided (Initial Version)**: This is critical. Your sketch **MUST** describe a **COMPLETE, END-TO-END, REASONABLE baseline pipeline**.
        - It must cover: Data loading (from specified paths), essential preprocessing (as per hypothesis or minimal viable), a basic model implementation (as per hypothesis), a simple validation strategy (e.g., a single train-validation split or fewer folds if CV is too complex initially), generation of `scores.csv`, and `submission.csv` in the correct format.
        - The overriding goal for this initial sketch is **RUNNABILITY and CORRECTNESS of the pipeline structure**. Prioritize getting a valid submission out, even with a very basic model. Avoid any complexity not absolutely mandated by the core hypothesis or competition basics.
    4. **Learn from Past Failures**:
      - If `Previous Failed Experiments & Feedback` are provided, analyze them meticulously. Design the sketch to explicitly avoid repeating similar mistakes, especially if failures relate to the current hypothesis, data handling, submission format, or resource usage (timeouts).
      - If a hypothesis aims to fix a past failure, the sketch should detail precisely how the fix is implemented.
    5. **Specificity and Clarity**:
      - Be unambiguous. Instead of "select model," if the hypothesis implies "Train an EfficientNet-B0 model," state that.
      - The sketch must be definitive. No open-ended options or phrases like "for example," or "e.g.," within a step's action.
    6. **Resource Constraints & Efficiency**:
      - Always design the workflow to execute within the competition `Time Limit`.
      - If `Previous Failed Experiments` explicitly state time/memory constraint issues, your sketch **MUST** make efficiency the **TOP PRIORITY**. Clearly state `[EFFICIENCY AS TOP PRIORITY]` at the beginning of your sketch.
      - The sketch must then detail *specific measures* to achieve this.
      - Even if the `Proposed Hypothesis` is not about efficiency, if past experiments failed due to timeouts or the dataset/model is complex, the sketch **must still incorporate measures to improve overall pipeline efficiency**. This might involve simplifying aspects unrelated to the core hypothesis to ensure the hypothesis can be tested within limits.
      - The goal is a workflow that successfully implements and validates the `Proposed Hypothesis` effectively, balancing performance with strict resource constraints. An experiment that times out provides no information.
      - If you plan to prioritize efficiency, you can modify the parts which is not related to the hypothesis. Which means your task should still able to validate the hypothesis.
      - Add [EFFICIENCY AS PRIORITY] tag in the task description to indicate that the task takes efficiency as a priority.
      - Although the task should prioritize efficiency, it should not be the only focus. The task should also be aligned with the proposed hypothesis and the current SOTA implementation.
    7. **Reminders of Common Mistakes (Especially for New `main.py`)**: At the end of your sketch, include a "Key Reminders for Developer" section. Add the following reminders if appropriate.
      - Ensure all input files are loaded from their exact paths under `{% include "scenarios.data_science.share:scen.input_path" %}` (e.g., `{% include "scenarios.data_science.share:scen.input_path" %}<competition_name>/train.csv`)."
      - Verify `submission.csv` strictly adheres to format: columns, correct data types, and no extra index.
      - "Implement correct label mapping for classification tasks (e.g., 0-indexed, contiguous integers for loss functions like PyTorch's CrossEntropyLoss) to prevent runtime errors."
      - Handle file I/O robustly, especially for zipped data or large files, to prevent `FileNotFoundError` or `BadZipFile` issues.
      - Confirm no `tqdm` or other progress bars are in the final script.
      - Double-check that validation scores are saved correctly to `scores.csv` with specified 'Model' and metric columns, even for a single model run (include 'ensemble' row).
    8. **EDA improvement**: The user might provide you some EDA improvement suggestions based on the previous EDA output. If so, you should also include the EDA improvement in your sketch.

    # Hyperparameters Specification
    Follow the hyperparameters specification below when approaching hyperparameter selection.
    If you are confident in a specific value based on strong evidence, prior experiments, or clear rationale, specify the value clearly.
    {% include "scenarios.data_science.share:spec.hyperparameter" %}

    {% if former_user_instructions_str is not none %}
    # Mandatory Consideration of Past User Instructions
    The user has provided specific instructions in previous experiments. These instructions may contain critical insights or constraints that must be considered in your sketch.
    Carefully review and integrate these instructions into your design to ensure alignment with user expectations and requirements.
    {{ former_user_instructions_str }}
    {% endif %}

    {% if task_output_format is not none %}

    # Output Format

    {% if not workflow_check %}

    {{ task_output_format }}

    {% else %}

    There are two steps in the task. But you should adhere to the final output format.

    ## [Partial Response Format 1]
    ### Step1: **Task Output Format** :
    {{ task_output_format }}

    ### Step 2: **Workflow Update** :
    Since components have dependencies, your second task is to update the workflow to reflect the changes made to the target component. Please also decide whether the workflow needs to be updated and provide a brief description of the change task.
    {{ component_desc }}

    ## [Partial Response Format 2] Your generated workflow description should be a simple text and the following agent will do the implementation. If you think the workflow should not be updated, just respond with "No update needed".

    At last, your final output should strictly adhere to the following JSON format. 
    {
      "task_design": a dict which strictly adheres to the **Task Output Format** in Step 1,
      "workflow_update": "A string which is a precise and comprehensive description of the Workflow Update, or 'No update needed' if no changes are required."
    }
    {% endif %}
    {% else %}
    Please response in json format.
    {% endif %}
    
  user: |-
    # Competition Scenario Description
    {{ scenario_desc }}

    # Data Folder Structure (All files are under {% include "scenarios.data_science.share:scen.input_path" %})
    {{ data_folder_info }}

    # Current SOTA Implementation & Feedback
    {{ sota_exp_desc }}

    # Proposed Hypothesis
    This sketch should implement the following hypotheses:

    {% for hypothesis in hypotheses %}
    ## {{ hypothesis.problem_name }}
    **Why:** {{ hypothesis.problem_desc }}
    **Hypothesis:** {{ hypothesis.hypothesis }}

    {% endfor %}
    # Previous Failed Experiments & Feedback (e.g., experiments that did not pass evaluation, encountered bugs, or failed to surpass SOTA performance)
    {{ failed_exp_and_feedback_list_desc }}
  
    {% if eda_improvement is not none %}
    {{ eda_improvement }}
    {% endif %}

idea_sample:
  system: |-
    You are a Kaggle Grandmaster and expert ML engineer with deep expertise in statistics, machine learning, and competition optimization.
    The user is improving a Kaggle competition implementation iteratively through traces where each new trace is modified from the current SOTA in the trace, not necessarily the immediate predecessor.
    You will be given a competition scenario, previous SOTA and failed experiments and feedbacks, and the current SOTA implementation and feedback.
    The user has identified potential problems in the current SOTA implementation and sampled few ideas for possible improvement direction for each of the problem.
    Your task is to identify the most useful and potential idea for each of the problem according to the impact, alignment, and novelty of the ideas.

    The user provided ideas might not be the suitable solution for the identified problems. If all ideas to one problem are not useful, please ignore this problem in your response dict.

    ### Specification
    {{ idea_spec }}

    ### Output Format
    {{ idea_output_format }}

  user: |-
    # Scenario Description
    {{ scenario_desc }}
    
    # Previous Experiments and Feedbacks
    {{ exp_feedback_list_desc }}    

    # Current SOTA Implementation
    {{ sota_exp_desc }}

    # Problem-Ideas Pairs
    {{ problem_ideas }}

specification:
  hypothesis: |-
    1. Each hypothesis should be specific and non-vague.
      - Avoid vague statements like "improve the model" or "optimize the pipeline." Instead, specify the exact changes to be made. Do not use ambiguous changes like "try method A or method B". 
      - No phrases like "for example" or "eg.," should be used in the hypothesis. Give a clear decision in the hypothesis.
    2. Each hypothesis should be testable and actionable. It should clearly state the expected change or improvement in the component's performance. For example, "tuning a model" is too broad, whereas "increasing the learning rate to 0.1 in the LightGBM model will improve performance" is testable and actionable.
    3. Each hypothesis should be aligned with the current SOTA implementation. It should be a potential solution to the identified problem.
    4. All the changes in the hypothesis should be correlated and relevant to each other. Avoid proposing multiple independent ideas in a single hypothesis.
    {% if not pipeline %}5. Each hypothesis should focus on a single direction per experiment. Avoid proposing multiple possibilities within the same hypothesis, such as "this may work in case A or case B." Research and development can be approached at different levels (shallow or deep), but each experimental loop should validate only one specific idea.
    6. Each hypothesis should focus on one component. The components will be described in the evaluation stage.
    {% else %}5. The hypothesis should focus on the whole pipeline. If needed, the hypothesis may propose changes across multiple parts in the SOTA implementation.
    {% endif %}

  idea: |-
    1. Alignment: The idea should be aligned with the identified problem. It should be a potential solution to the problem.
    2. Novelty: The idea should be novel and not previously explored in the current SOTA implementation. Avoid ideas that have already been tried and failed.
    3. Impact: The idea should have the potential to significantly improve the current SOTA implementation. It should be a promising direction for further exploration.
    4. You should identify the most useful and potential idea for each of the problem. If none of the provided ideas are useful, please ignore this problem in your response dict.

output_format:
  problem: |-
    For each of the identified problem, you should strictly adhere to the following JSON schema. 
    Your final output should be a dict containing all the identified problem without anything else.
    Please respond at most five problems FEWER BUT BETTER considering the most valuable and recently not explored. Don't respond problems not relevant to the improvement of target metric.
    {
      "problem name 1 (name of the identified problem without anything else)": {
        "problem": "Description of the first issue in no more than three sentences.",
        "reason": "Brief explanation of why this is a problem, based on the feedback or inferred from provided materials in no more than two sentences."
      },
      "problem name 2 (name of the identified problem without anything else)": {
        "problem": "Description of the second issue in no more than three sentences.",
        "reason": "Brief explanation of why this is a problem, based on the feedback or inferred from provided materials in no more than two sentences."
      }
    }
  hypothesis: |-
    For each of the identified problem, you should propose a hypothesis strictly following to the JSON schema. Your final output should be a dict containing all the proposed hypothesis.
    {
      "problem name 1 (should be exactly same as the problem name provided)": {
        {% if enable_idea_pool %}"inspired": "True or False. Set to True if the hypothesis is inspired by the user provided ideas. Otherwise, set it to False.",{% endif %}
        "reason": "Provide a clear, logical progression from problem identification to hypothesis formulation, grounded in evidence (e.g., trace history, domain principles, or competition constraints). Refer to the Hypothesis Guidelines for better understanding. Reason should be short with no more than two sentences.",
        "component": "The component tag of the hypothesis. Must be one of ('DataLoadSpec', 'FeatureEng', 'Model', 'Ensemble', 'Workflow').",
        "hypothesis": "A concise, testable statement derived from previous experimental outcomes. Limit it to one or two sentences that clearly specify the expected change or improvement in the <component>'s performance.",
        "evaluation": {
          "alignment_score": "The alignment of the proposed hypothesis with the identified problem.",
          "impact_score": "The expected impact of the proposed hypothesis on the current SOTA implementation.",
          "novelty_score": "The novelty of the proposed hypothesis compared to existing solutions.",
          "feasibility_score": "The feasibility of implementing the proposed hypothesis in the current SOTA implementation.",
          "risk_reward_balance_score": "The risk-reward balance of implementing the proposed hypothesis.",
        }
      },
    }
  idea: |-
    For each of the problems, you should identified the most useful and potential idea strictly following to the JSON schema.
    Your final output should be a dict containing the problems and corresponding identified ideas pairs without anything else.
    Please respond at most five problem-ideas pairs considering the most valuable and recently not explored.
    {
      "problem name 1 (should be exactly same as the problem name provided)": 1, # The index which is same to the idea index provided in the input and must be integer.
      "problem name 2 (should be exactly same as the problem name provided)": 2, # The index which is same to the idea index provided in the input and must be integer.
    }

  critique: |-
    For each hypothesis, provide a comprehensive critique strictly following the JSON schema.
    Your final output should be a dict containing critiques for all hypotheses without anything else.
    {
      "critiques": {
        "problem name 1 (should match the hypothesis problem name exactly)": {
          "critique": "A comprehensive critique covering: (1) Technical feasibility and potential issues, (2) Alignment with the scenario and competition requirements, (3) Specific improvement suggestions, (4) Overall assessment of the hypothesis quality and implementability. Be constructive and actionable."
        },
        "problem name 2": {
          "critique": "..."
        }
      }
    }
  rewrite: |-
    For each original hypothesis, rewrite it to address critique feedback, strictly following the JSON schema below. 
    Your final output should be a dict containing all rewritten hypotheses without anything else.
    {
      "problem name 1 (should be exactly same as the original problem name without prefix or suffix)": {
        "reason": "Independent justification for why this hypothesis makes sense given the current scenario, dataset characteristics, and competition requirements. DO NOT reference critique feedback or suggestions. Should be short with no more than two sentences focusing on the fundamental problem context.",
        "component": "The component tag of the hypothesis. Must be one of ('DataLoadSpec', 'FeatureEng', 'Model', 'Ensemble', 'Workflow').",
        "hypothesis": "A concise, improved hypothesis statement that directly addresses critique concerns. Limit to one or two sentences that clearly specify the expected change or improvement. Should be more specific and actionable than the original.",
        {% if enable_scale_check %}"appendix": "A short sentence indicating whether the hypothesis is targeted for scaling or not. Give instructions to the following steps about implementing this hypothesis.", {% endif %}
        "evaluation": {
          "alignment_score": "Score from 1 (lowest/worst) to 10 (highest/best). How directly and effectively does the hypothesis address the core issues of the identified problem it targets? A higher score means a stronger, more direct alignment.",
          "impact_score": "Score from 1 (lowest/worst) to 10 (highest/best). What is the estimated magnitude of improvement (e.g., in the primary competition metric, efficiency, robustness, or successful execution) if this hypothesis is successfully implemented? Higher scores for greater positive impact.",
          "novelty_score": "Score from 1 (lowest/worst) to 10 (highest/best). How innovative or original is this hypothesis when compared to the approaches and ideas evident in the previous SOTA experiments and previous failed experiments? Assign a score of 1 if the hypothesis is a repeat or substantially similar to a previously attempted hypothesis (whether successful or failed), UNLESS the previous attempt clearly failed due to a trivial implementation bug and the current hypothesis proposes the correct implementation of the same core idea.",
          "feasibility_score": "Score from 1 (lowest/worst) to 10 (highest/best). How easily and practically can this hypothesis be implemented and run to completion within the existing SOTA codebase and operational constraints (e.g., allowed time for training/inference, available compute resources, overall complexity)? Higher scores for easier implementation and higher likelihood of successful execution.",
          "risk_reward_balance_score": "Score from 1 (lowest/worst) to 10 (highest/best). Considering the potential for significant improvement (reward) versus the probability of failure, negative side-effects, or excessive resource consumption (risk), how optimal is this balance? A high score indicates a favorable balance. If a hypothesis directly and credibly addresses a critical challenge that caused prior experiment failures (e.g., timeout, persistent data loading errors, incorrect submission format preventing any score), this should generally be scored highly (e.g., 8-10).",
        }
      }
    }

  hypothesis_select_format: |- 
    You must return a dictionary in the following format for hypothesis
    {
      "hypothesis": "...",  
      "component": "..."  // Must be one of: 'DataLoadSpec', 'FeatureEng', 'Model', 'Workflow', 'Ensemble'
    }

