UPDATE_INSTRUCTION_PROMPT = """
As a Kaggle Grandmaster competing in a challenge, your task is to suggest potential evolutionary improvements that could enhance the performance of the code.
Propose insights to help improve the performance of the model on this dataset.
The insights should be proposed based on the dataset description and given EDA. 
There are three stages in total: Exploratory data analysis, data cleaning and feature engineering, model training.
You are on the {current_task} stage.

* **Background Information about problem (`background_info`):**
{background_data}

Ideas for previous stages: 
{previous_ideas}

We collect the best ideas from Kaggle competitions and ArXiv papers. Based on these, you can generate new ideas — but be sure to introduce original concepts, not just rephrase existing ones. 
If there’s nothing listed here, it means the agent determined that retrieval-augmented generation (RAG) was not necessary in this case.
Rag context:
{rag_context}

Here is a memory from ideas that you generated before that was collected by {memory_algorithm}. Use them to generate new ideas with better diversity. 
{memory_context}

When generating, you can base your ideas on some of the ideas from context, but be careful, don't forget to come up with your own ideas and think carefully about whether this idea is suitable for our competition.

You must generate at least {number_of_ideas} insight.
Make sure each method is diverse enough and can be implemented separately.
Be specific about models' choices, ensemble and tuning techniques, and preprocessing & feature engineering techniques.

Your model choices must be as advanced as possible. For example on model training stage you should not generate idea "train logistic regression" (because logistic regression is very simple and unaccurate model), but generate ideas such as "train catboost with depth 60" or "train xgboost with depth 8"

Your insight must contain only one idea. Do not train ensembles and do not generate few features in one idea. For example for model training phase you generation must look like this: "train catboost with depth 60", but must look like this: "Implement a stacking ensemble with catboost with depth 60 and xgboost with depth 8 and transformer model with 10 layers" (because it is combination of ideas and if some of them is not good for this competition it can worsen the final score)

Do not use models such as LightGBM or those that do not train well on GPUs, unless they are statistical models or similar. All training should be completed in a short period of time. If you are training models, try to utilise all GPUs.

## EDA IMAGES ##
All images in your context are images that was generated on EDA phase.


## EDA OUTPUT ##
{eda_output}

## Remember that each idea should not be used for more than 60 minutes. Remember that each idea should not take longer than 60 minutes to implement. You should not come up with a huge number of hyperparameters or train huge models.

# Format
```json
{{
    "insights": [
        "insight1",
        ...
        "insight{number_of_ideas}"
    ]   
}},
```
"""

GENERATE_IDEAS_FOR_COMPLEX_TRAINING = """
As a Kaggle Grandmaster, your primary task is to propose diverse **complex training strategies** for machine learning and neural network models. These strategies should focus *strictly on core model training aspects*, including model class choice and specific model architecture/implementation. **Do not include explicit hyperparameter tuning or complex ensemble techniques like stacking or blending.** If necessary for robustness, a simple bagging approach for a single model class representative is acceptable, but the focus should remain on the fundamental behavior of the model class. **The emphasis is on model training, not data preprocessing or feature engineering.**
The goal is to generate a set of these diverse training strategies, each representing a distinct approach to model training (e.g., one strategy focusing on linear models, another on tree-based models, a third on neural networks, etc.). This diversity is crucial for understanding how different algorithmic classes perform.

* **Background Information about problem (`background_info`):**
{background_data}


---

### Task 1: Generate Diverse Complex Training Strategies

For each strategy, provide a **detailed, actionable description** of what needs to be done. This description should be comprehensive enough that a data scientist could directly implement it. Specifically, for each strategy, include:

* **Strategy Name**: A concise, descriptive name for the strategy.
* **Primary Model Class**: The main type of algorithm used (e.g., Boosting, Neural Networks, Linear Models, Support Vector Machines, K-Nearest Neighbors).
* **Specific Model(s)**: Exact model names or architectural descriptions within the chosen class (e.g., LightGBM, XGBoost, CatBoost, simple Multi-Layer Perceptron (MLP) with X layers and Y neurons per layer, Logistic Regression, basic SVM).
* **Key Training Aspects**: Describe the core training setup. This includes:
    * **Initialization**: How models are initialized (e.g., default initialization, specific random seeds).
    * **Training Process**: How the model is trained (e.g., number of epochs/iterations, use of a simple validation split if required for early stopping, but *without extensive hyperparameter sweeps*).
    * **Computational Considerations**: Any specific hardware or library considerations (e.g., `task_type='GPU'`, multi-threading).
    * **Approach to Multiple Models (if any)**: If more than one model is used within a strategy, describe why (e.g., "Bagging 5 instances of LightGBM to assess average performance of tree ensembles", "Training 3 simple MLPs with different random seeds for robustness check"). **Avoid complex ensemble methods.**

Each strategy must be implementable independently and quickly (no more than {max_minutes_to_run} minutes for training per strategy).

You must generate at least {number_of_ideas_min} and no more than {number_of_ideas_max} distinct complex training strategies.

---

### Task 2: Select a "Reference" Strategy for Feature Engineering Comparison

From the generated diverse strategies, choose **one single complex training strategy** that will serve as a "reference" point. This chosen strategy will be re-evaluated on datasets with different feature engineering to observe its performance change. This will help infer how feature engineering impacts a specific algorithmic class, which can then be used to predict the performance of other model classes.

**Criteria for selection**: Choose a strategy that is robust and representative enough to provide meaningful insights when its performance is compared across different feature sets. A good candidate would be a strategy based on a commonly used and relatively stable model class.

---

## Context for Strategy Generation

You are provided with:
* **Dataset Description**: Information about the dataset.
* **EDA Output**: Results from the Exploratory Data Analysis phase, which can help in understanding the data's characteristics but should *not* be used to propose data preprocessing or feature engineering techniques. Use this to inform your model choices and potential training complexities within the specified limits.

## EDA OUTPUT ##
{eda_output}

# Format
This format must be strictly adhered to. The output must be a valid JSON object with the following top-level keys:

"complex_training_strategies": An array of strategy objects. Each object must contain "name", "primary_model_class", and a "description" key. The "description" key must be a string detailing the strategy.

"reference_strategy_for_feature_comparison": A single strategy object, mirroring the structure of those in the array, and must include its "description".

```json
{{
    "complex_training_strategies": [
        {{
            "name": "idea about boosting models",
            "primary_model_class": "Boosting",
            "description": "Train a single LightGBM model. Use default parameters for boosting, focusing on a fixed number of boosting rounds (e.g., 1000 iterations). Implement a simple early stopping mechanism using a 10% validation split to prevent overtrianing, with patience of 50 rounds. Leverage GPU acceleration if available by setting `task_type='GPU'` to ensure fast execution."
        }},
        {{
            "name": "Strategy 2: Simple Multi-Layer Perceptron (MLP)",
            "primary_model_class": "Neural Networks",
            "description": "Implement a basic Multi-Layer Perceptron (MLP) with two hidden layers. The first hidden layer should have 128 neurons, and the second 64 neurons, both using ReLU activation. Train for a fixed number of epochs (e.g., 50 epochs) with a batch size of 32 using the Adam optimizer. Initialize weights using Kaiming Uniform. Use a simple 10% validation split to monitor loss during training."
        }},
        ...
    ],
    "reference_strategy_for_feature_comparison": {{
        "name": "Strategy X: [Name of the chosen reference strategy]",
        "primary_model_class": "[Class of the chosen strategy]",
        "description": "[Full description of the chosen reference strategy]"
    }}
}}
```
"""

EDA_PROMPT = """
As a Kaggle Grandmaster competing in a challenge, your task is to suggest potential evolutionary improvements that could enhance the performance of the code.
Propose insights to help improve the performance of on this dataset.
The insights should be proposed based on the dataset description. 
There are three stages in total: Exploratory data analysis, data cleaning and feature engineering, model training.
You are on the Exploratory data analysis stage.

--- **Background Information about problem:** ---
{background_data}
---

You must generate as much as possible insights.
Make sure each method is diverse enough and can be implemented separately.

Your insights should be as original as possible and show information that could potentially maximize the metric.

# Format
```json
{{
    "insights": [
        "insight1",
        ...
        "insight{number_of_ideas}"
    ]   
}},
```
"""

ADD_INSIGHTS_PROMPT = """
You are an AI assistant tasked with analyzing a machine learning solution and proposing new insights to improve its performance. Given the current solution code and development score, suggest innovative approaches to enhance the model.

I'll give you all the approaches I've tried so far, some additional information about competition and EDA. When you come up with your insights, try to make them as different as possible from the ones I have already implemented.

## COMPETITION INFO ##
{background_info}

## CURRENT SOLUTION CODE ##
{solution_code}

## SOLUTION SCORE ##
{score}

## PREVIOUS APPROACHES ##
{previous_approaches}

## EDA IMAGES ##
All images in your context are images that was generated on EDA phase.

## EDA OUTPUT ##
{eda_output}

Based on this information, propose 1-{max_add_idea} new insights for {current_task}. Your insights should be specific, actionable, and have the potential to improve the model's performance.

When generating ideas, consider my device configuration (so that it can run in adequate time without causing a memory overflow)

## DEVICE CONFIG ##
{device_info}

Please format your response as a JSON array with the following structure:
{{
    "insights": [
        "insight1",
        "insight2",
        ...
    ]
}}
"""

MERGE_INSIGHTS_PROMPT = """
You are an AI assistant tasked with analyzing a machine learning solution and proposing insights for merging 2 previous ideas to improve its performance. Given the first and second ideas code, current task, suggest innovative approaches to enhance the model.

First idea code (score {score1}): 
{idea1}

Second idea code (score {score2}):
{idea2}


Please format your response as a JSON array with the following structure:
{{
    "insights": [
        "merge_idea",
    ]
}}
"""

ADD_INSIGHTS_NO_PARENT_PHASE_PROMPT = """
You are an intelligent assistant helping a data scientist during the "{task_name}" phase of a machine learning pipeline.
Below is the output from the EDA phase that includes key statistics, distributions, anomalies, and other observations:

--- COMPETITION INFO START ---
{background_info}
--- COMPETITION INFO END ---

--- EDA OUTPUT START ---
{eda_output}
--- EDA OUTPUT END ---

Additionally, the following ideas have already been explored during this phase, along with their corresponding evaluation scores:

--- IDEAS AND SCORES START ---
{all_ideas_and_scores}
--- IDEAS AND SCORES END ---

You also have access to relevant EDA visualizations (not included in the text above).

A new parameter, "group_name", is provided. If "group_name" is None, you may generate ideas arbitrarily. If "group_name" is specified, all generated ideas must be thematically related to that group.

Your task is to reflect on all this information and generate new, diverse, and insightful ideas to try during the "{task_name}" phase. Do not repeat existing ideas. Try to build upon or go in new directions based on the data, analysis so far, and the specified group context.

Based on this information and the value of "group_name": {group_name}, propose 1-{max_add_idea} new insights for {task_name}.

Please format your response as a JSON array with the following structure:
{{
    "insights": [
        "insight1",
        "insight2",
        ...
    ]
}}
"""

SPLIT_IDEAS_INTO_GROUPS = """You are an advanced data-processing assistant. Your goal is to group a set of feature-processing ideas into semantically meaningful categories.

Input ideas:
{ideas}

Instructions:

1. Read each idea, focusing on the feature-processing step it represents.
2. Invent clear, concise English names for each group—for example, more specific labels such as "One-Hot Encoding for Categorical Variables", "Statistical Outlier Profiling", "PCA for Dimensionality Reduction", or "Missing Value Imputation with Median", rather than broad terms like "Feature Engineering".
3. Use only integer values to refer to nodes (e.g., 0, 1, 2), not strings with prefixes. For instance, refer to node indexes simply as 0, 1, 2.
4. Assign each idea to exactly one group based on its semantic role.
5. In the "thoughts" field, explain your internal reasoning about why you chose these groups and assignments.
6. Output only a JSON object in this exact format (this is example):
{{
   "thoughts": "Your internal reasoning and thought process here",
   "Name of the first group": [
        7,
        1
   ],
   "Name of the second group": [
        9
   ],
   ...
}}

Make sure group names accurately summarize the specific processing step each contains."""

ADD_IDEAS_TO_GROUPS = """
- group_names:
{group_names}

- ideas_without_group:
{ideas_without_group}

Instructions:

You are an expert LLM agent focused on the feature-engineering stage of a project. 
Your goal is to take each idea without a group and assign it to one of the existing groups, or to a new group if none of the current ones fit. 
Think step by step, showing your internal reasoning in “thoughts,” then output only a JSON object.

Output format (strict JSON):
{{
  "thoughts": "<Your internal reasoning here>",
  "Name of the first group": [<list of integer indexes>],
  "Name of the second group": [<list of integer indexes>],
  ...
}}

If you believe any ideas warrant a new group, feel free to invent a new group name not listed in group_names and assign indexes to it.
"""

SELECT_GROUP = """You are an expert research assistant. Your task is to choose a single group from a set of candidate groups for further idea exploration, or to propose a brand‑new group if that would be more fruitful—unless the task_name is "Model training", in which case you must select an existing group.

Parameters:
- task_name = {task_name}: 
(the name of the current task (e.g., "Model training", "Data preparation and feature engineering", etc.))

Input groups:
{groups}

Instructions:
1. Review the statistics for each group:
   - number of ideas in the group: number of ideas in the group
   - number_of_inputs: total number of idea submissions in this group  
   - max_score: highest evaluation score among the group's ideas  
   - mean_score: average evaluation score  
   - median_score: median evaluation score  
   - std_score: standard deviation of the evaluation scores  

2. Consider both exploitation (groups with strong past performance) and exploration (under‑explored groups with potential). Do not always choose the highest‑scoring group—balance proven success against new opportunities.

3.  
   - If task_name == "Model training":  
     • You **must** select one of the existing groups (do **not** create a new group).  
   - Otherwise:  
     • If none of the existing groups seem optimal, you may create a new group that better captures promising ideas or themes.

4. Select exactly one group (existing or newly created, per the rule above) that you believe is best suited for additional research.

5. Output only a JSON object in this exact format:
{{
    "thoughts": "Your internal reasoning and thought process here",
    "select_group": "the name of the chosen or newly created group"
}}

Do not include any additional fields or commentary."""

SELECT_NODES = """You are an expert idea selector specializing in feature-engineering concepts that will serve as foundational ideas for the next modeling phase.

Input ideas (feature engineering entries):
{all_ideas_and_scores}

Additional parameter:
number_of_selected_node: {number_of_selected_node}

Instructions:
1. Parse each feature-engineering idea entry, noting:
   - index: the idea’s identifier
   - mean_score: average evaluation score of the idea
   - number_of_inputs: how many times this idea has been proposed
2. Remember that these are feature-engineering proposals intended to feed into the modeling stage.
3. Balance exploitation (ideas with high mean_score) and exploration (ideas with fewer inputs that may uncover novel modeling opportunities).
4. Select exactly number_of_selected_node ideas that together optimize proven quality and potential for new discoveries in modeling.
5. In the "thoughts" field, explain your internal reasoning and criteria for selection.
6. Output only a JSON object in this exact format:
{{
    "thoughts": "Your internal reasoning and thought process here",
    "select_nodes": [list of selected idea indexes] <- list[int]
}}

Do not include any additional fields or commentary."""