EDA_TASK_PROMPT = """You are a Kaggle grandmaster and you are making exploratory data analysis of a Kaggle competition at the moment.
You have a description of the task that you use to do.
I also will give you information competition description.

Develop an efficient solution based on the provided ideas:
1. Implement specific tasks and methods outlined in the description.
2. Ensure code is clear, concise, and well-documented.
3. Utilize available tools by calling them with correct parameters.
4. Consider data types, project requirements, and resource constraints.
5. Write code that is easily understandable by others.

Remember to balance efficiency with readability and maintainability.

# PATHS #
Define the following base directories:
    DATA_DIR   = r"{data_dir_path}"
    RESULT_DIR = r"{result_dir_path}"

All file loading and saving operations must be performed relative to these paths using os.path.join().

#############
# CONTEXT #
1. Exploratory data analysis
2. Data preparation and feature engineering
3. Model training and submission generation

#############
# INFORMATION #
{background_info}

#############
# DESCRIPTION #
{idea}


# CONSTRAINTS #

## DATA HANDLING ##
1. Data Loading:
    - EDA: Load data from os.path.join(DATA_DIR, "train.csv") and os.path.join(DATA_DIR, "test.csv")

2. Data Saving:
    - Save image files in the directory os.path.join(RESULT_DIR, "images").
    - Use clear, meaningful names for image files that reflect their content. Do NOT use special characters or spaces in file names.
    - Some additional interesting information that you find you should print to terminal (do not print information that you have already saved as image). 
    - Save specific files for each phase:
      - Exploratory data analysis: Do not save any files apart from images. 

3. Data Processing:
    - Always work on a copy of the DataFrame, not the original.
    - Ensure correct data types for all columns before any operations.
    - Take care with target-dependent operations on the test set, which lacks the target variable.
    - Do NOT modify Id-type columns.

## CODING PRACTICES ##
1. General Rules:
    - Use print() for outputting values.
    - Avoid writing assert statements.

2. Visualization:
    - Use plt.close() after saving each image.
    - Limit EDA visualizations to 10 or fewer, focusing on the most insightful.
    - Optimize for large datasets (e.g., annot=False in seaborn heatmaps).

3. Efficiency:
    - Prioritize runtime efficiency, especially for:
      - Data visualization
      - Large dataset handling
      - Complex algorithms
    - Use `tqdm` to wrap iterables for clear progress visualization in loops where appropriate, especially for time-consuming operations.

## EXAMPLES ##
- Before calculating a correlation matrix, ensure all data is numerical. Handle non-numerical data appropriately.
- Verify consistent data types across columns before merging or joining operations.

Remember: Always consider resource constraints and prioritize efficiency in your code.

#############
Your response MUST be structured as follows: First the reasoning, then the complete code enclosed in a single markdown block.

## Example of a correct response format:
Chain of Thought:   
...

I will apply this fix and return the entire corrected script:
```python
correct_code
```

You MUST return code, if you won't I will lose my job"""

FE_TASK_PROMPT = """You are a Kaggle Grandmaster. Your task is to write a complete Python script for data preparation and feature engineering based on the provided plan.

# PLAN & CONTEXT

  * **Competition Info (`background_info`):**
    {background_info}

  * **EDA Summary (`eda_output`):**
    {eda_output}

  * **Available Hardware (`device_info`):**
    {device_info}

  * **Execution Plan (`idea`):**
    {idea}

# REQUIREMENTS & CONSTRAINTS

### 1. Data I/O and Artifacts

  * **Base Paths:**

      * Source Directory: `SOURCE_DIR = r"{data_dir_path}"`
      * Output Directory: `RESULT_DIR = r"{result_dir_path}"`
      * Use `os.path.join()` for all path operations.

  * **Data Loading:**

      * **First, analyze `background_info` to understand the data structure.**
      * For simple tabular data, load source files directly, e.g.:
        `train_df = pd.read_csv(os.path.join(SOURCE_DIR, "train.csv"))`

  * **Handling Non-Tabular Data (Images, Text, etc.):**

      * If source files contain paths to external data (like images), your workflow should be:
        1.  **Process the external files** (e.g., perform image augmentation, text vectorization).
        2.  **Save the resulting artifacts** to a new, dedicated folder, for example:
            `processed_images_dir = os.path.join(RESULT_DIR, "processed_data", "processed_images_{node_index}")`
        3.  If you create any other new folder, its name must also be unique and include {node_index}
        4.  The final output DataFrames must contain **paths to these new artifacts** along with any other engineered features.

  * **Mandatory Output Files:**

      * You **MUST** save the final processed DataFrames to these exact paths:
        `processed_train_path = os.path.join(RESULT_DIR, "processed_data", "processed_train_{node_index}.csv")`
        `processed_test_path  = os.path.join(RESULT_DIR, "processed_data", "processed_test_{node_index}.csv")`

### 2. Feature Engineering Rules

  * **Consistent vs. Training-Only Transformations:**
      * **Consistent Transformations:** Preprocessing steps that define the feature space (e.g., scaling, normalization, tokenization, one-hot encoding) **must be applied consistently** to both training and test sets. The scaler/encoder must be `fit` on the training data and then used to `transform` both train and test data.
      * **Training-Only Augmentations:** Transformations intended for regularization (data augmentation) **must only be applied to the training set**. The test set should represent the original, unaltered data distribution.
          * **Example (Images):** Apply random flips, rotations, or crops to training images, but only apply resizing and normalization to test images.
          * **Example (Text):** Apply techniques like random word swapping or back-translation only to the training text.

  * **Data Integrity:** Always work on a copy of the data, not the original. Do not modify ID columns.
  * **Target Leakage:** Do not use the target variable to engineer features for the test set.

### 3. Execution & Coding Standards
  * **Efficiency:** Write efficient code. Use the available device (`cuda` or `cpu`) intelligently. Manage memory and consider parallelization for heavy tasks. For long-running loops or operations, print periodic updates (e.g., every 10% or at significant milestones) including an estimate of the remaining time. Ensure these updates are not excessively frequent as output is redirected to a file.
  * **Logging & Debugging:** Use `print()` for logging progress. Do not use `assert` statements.
  * **Code Quality:** Write clear, well-documented, and maintainable code.
  * **Warning:** Be cautious with `df.iloc` as column order is not guaranteed. Prefer selecting columns by name.

### 4. Code execution time:

  * Logically see if the code will be able to complete execution in 30 minutes

# RESPONSE FORMAT

Your response MUST be structured as follows: First the reasoning, then the complete code enclosed in a single markdown block.

## Example of a correct response format:

Chain of Thought:  
...

I will apply this fix and return the entire corrected script:

```python
# correct_code
```
"""

MODELING_TASK_PROMPT = """You are a Kaggle Grandmaster. Your task is to write a complete and efficient Python script for model training and submission generation based on the provided plan and context.

# PLAN & CONTEXT

  * **Background Information (`background_info`):**
    {background_info}

  * **EDA Summary (`eda_output`):**
    {eda_output}

  * **Existing Code (`previous_code`):
    {previous_code}

  * **Available Hardware (`device_info`):**
    {device_info}

  * **Execution Plan (`idea`):**
    {idea}

# REQUIREMENTS & CONSTRAINTS

### Data Handling & I/O

  * **Data Structure:** The input data may not be purely tabular. For example, a CSV file might contain paths to images or text files. **Refer to the competition description in `background_info`** to understand the data format and implement the appropriate loading mechanism.
  * **Base Directory:** `RESULT_DIR = r"{data_dir_path}"`. All file paths must be constructed using `os.path.join()`.
  * **Load main data:
    * `train_df = pd.read_csv(os.path.join(RESULT_DIR, "processed_data", "processed_train_{parent_index}.csv"))`
    * `test_df  = pd.read_csv(os.path.join(RESULT_DIR, "processed_data", "processed_test_{parent_index}.csv"))`
  * **Output Submission:**
    * Save to `submission_path = os.path.join(RESULT_DIR, "submissions", "my_submission_{node_index}.csv")`.
    * **Sample Submission:** The submission file must match the format of the sample submission located at `{sample_submission_path}`. Ensure that the IDs in your submission file align with the IDs in the sample submission.

### Execution & Performance

  * **Device:** Automatically detect and use `cuda` if available; otherwise, use `cpu`. Optimize library settings (e.g., LightGBM, PyTorch) for the chosen device.
  * **Efficiency:** Write computationally efficient code. Manage memory usage carefully, especially with large datasets. Utilize all available CPU cores if not using a GPU. For long-running loops or operations (e.g., training epochs, data processing), print periodic updates (e.g., every N epochs or at significant milestones) including an estimate of the remaining time. Ensure these updates are not excessively frequent as output is redirected to a file.
  * **Logging:** Use `print()` to output training progress and evaluation metrics.
  * **Validation:** Use a train/validation split with early stopping where applicable.

### Coding Standards

  * Write clear, well-documented, and maintainable code.
  * **Do not** use `assert` statements.
  * **Warning:** Be cautious with `df.iloc`, as column order is not guaranteed. Prefer using column names for selection.

### Code execution time:

  * Logically see if the code will be able to complete execution in 60 minutes

# RESPONSE FORMAT

Return only the new Python code for this task. Do not include any of the previous code.

Your response MUST be structured as follows: First the reasoning, then the complete code enclosed in a single markdown block.

## Example of a correct response format:

Chain of Thought:  
...

I will apply this fix and return the entire corrected script:

```python
correct_code
```
"""

MERGE_IDEAS_CODE_FE = """You are an elite Kaggle Grandmaster specializing in feature engineering. Your task is to analyze two feature engineering approaches, synthesize their strengths, and create a single, production-quality Python script.

### **CONTEXT**

You have two competing ideas with scores and implementations. Use `score_mode` ({score_mode}) to resolve conflicts.

Data Paths:
SOURCE_DIR = `{data_dir_path}`
RESULT_DIR = `{result_dir_path}`
Unique ID: `{node_index}`

Available Hardware:

{device_info}

**Idea 1 (Score: {score1}):**
  * **Description:** {idea1}
  * **Implementation:**
    ```python
    {code1}
    ```

**Idea 2 (Score: {score2}):**
  * **Description:** {idea2}
  * **Implementation:**
    ```python
    {code2}
    ```

---
### **TASK & INSTRUCTIONS**

#### **1. Data I/O and Artifacts**
  * Define `SOURCE_DIR` and `RESULT_DIR` at the top. Use `os.path.join()` for all paths.
  * Load data from `SOURCE_DIR`.
  * For non-tabular data, process, save artifacts to a **new, unique folder** (os.path.join(RESULT_DIR, processed_data, `processed_images_{node_index}`)), and include artifact paths in final DataFrames.
  * You **MUST** save the final processed DataFrames to these exact paths:
    `processed_train_path = os.path.join(RESULT_DIR, "processed_data", "processed_train_{node_index}.csv")`
    `processed_test_path  = os.path.join(RESULT_DIR, "processed_data", "processed_test_{node_index}.csv")`

#### **2. Feature Engineering Logic**
  * Implement all *unique* features from both ideas.
  * For conflicting features, use the implementation from the **better-scoring idea**. State this choice in your reasoning.
  * Wrap all FE logic in a single function (e.g., `feature_engineering(df, ...)`).
  * Apply identical transformations to train and test sets.
  * Work on a copy of data (`df.copy()`). **Do not modify ID columns.**

#### **3. Code Quality & Execution**
  * Produce clean, readable, well-commented, and optimized code. Remove redundancy.
  * Utilize GPU for heavy computations if available; otherwise, leverage all CPU cores. Manage memory.
  * Use `print()` for logging. Avoid `assert`. Prefer column selection by name (`df['column_name']`) over `df.iloc`.

---
### **RESPONSE FORMAT**

First, provide your **Chain of Thought**: plan, feature selection from each idea, conflict resolutions, and code structure.
Second, provide the complete, final Python code in a single markdown block:

**Chain of Thought:**
  * [Your reasoning here]

**Merged Code:**
```python
# your code
```
"""

MERGE_IDEAS_CODE_MODELING = """You are an elite Kaggle Grandmaster specializing in model building and ensembling. Your task is to analyze two modeling approaches, synthesize their strengths, and create a single, production-quality Python script that trains a model and generates a submission file.

### **CONTEXT**

You have two competing ideas with scores and implementations. Use `score_mode` ({score_mode}) to resolve conflicts between the two approaches.

Data & Hardware:
* **Base Directory:** `DATA_DIR = r"{data_dir_path}"`
* **Unique ID:** `{node_index}`
* **Parent Node ID:** `{parent_index}`

* **Available Hardware:**
    {device_info}

**Idea 1 (Score: {score1}):**
  * **Description:** {idea1}
  * **Implementation:**
    ```python
    {code1}
    ```

**Idea 2 (Score: {score2}):**
  * **Description:** {idea2}
  * **Implementation:**
    ```python
    {code2}
    ```

---
### **TASK & INSTRUCTIONS**

#### **1. Data I/O**
  * Define `DATA_DIR` at the top. Use `os.path.join()` for all paths.
  * Load the processed data from the feature engineering step:
    `train_df = pd.read_csv(os.path.join(DATA_DIR, "processed_data", f"processed_train_{parent_index}.csv"))`
    `test_df  = pd.read_csv(os.path.join(DATA_DIR, "processed_data", f"processed_test_{parent_index}.csv"))`
  * You **MUST** save the final submission file to this exact path:
    `submission_path = os.path.join(DATA_DIR, "submissions", f"my_submission_{node_index}.csv")`

#### **2. Modeling Logic**
  * Analyze both modeling strategies (e.g., model algorithm, validation scheme, hyperparameters).
  * Your goal is to create the best possible model. For conflicting choices (like model type or validation strategy), use the implementation from the **better-scoring idea**. Explicitly state this choice in your reasoning.
  * If the ideas use compatible models (e.g., both use LightGBM), try to blend the best hyperparameters from both.
  * Train the final model on the full training data and generate predictions for the test data.

#### **3. Code Quality & Execution**
  * Produce clean, readable, well-commented, and optimized code. Remove redundancy.
  * Automatically detect and utilize `cuda` for heavy computations if available; otherwise, leverage all available CPU cores.
  * Use `print()` for logging key information like validation scores, fold progress, and final score. Avoid `assert`.
  * Prefer column selection by name (`df['column_name']`) over `df.iloc`.

---
### **RESPONSE FORMAT**

First, provide your **Chain of Thought**: a concise plan outlining your choice of model, validation strategy, and how you merged the ideas.
Second, provide the complete, final Python code in a single markdown block.

**Chain of Thought:**
  * [Your reasoning here]

**Merged Code:**
```python
# your final python code here
```
"""

FORMAT_CODE = """
Here's my code:
```python
{code}
```

#############
Instructions
#############
You are a Python code refactoring assistant. Your task is to take the provided script — which contains multiple stages of Titanic data preparation, model training, and prediction — and perform the following steps:

1. **Collect all import statements at the top of the file**  
   - Remove duplicate imports.  
   - Group and sort imports: standard library → third‑party libraries → local modules.

2. **Consolidate and normalize constant definitions**  
   - Keep only one definition of `SOURCE_DIR`, `DATA_DIR` and `RESULT_DIR` (remove duplicates).  
   - Ensure path strings use a consistent style (raw strings or double backslashes).

3. **Merge data loading and preprocessing**  
   - Remove intermediate CSV saves (`processed_train.csv`, `processed_test.csv`).  
   - Pass `train_df` and `test_df` directly from preprocessing into the encoding and training steps.

4. **Remove commented‑out blocks** if they are not used in the final code.

5. **Structure the code as a clear, step‑by‑step script**  
   - Logical sequence:  
     1) imports and constants;  
     2) data loading and preprocessing;  
     3) categorical encoding;  
     4) feature and target definition;  
     5) model training;  
     6) prediction and result saving.  
   - Add concise comments between stages.

6. **Return only the refactored code** with no additional explanation.
    ```python
    [code]
    ```
"""

REPLACE_DIR_FOR_TEST_PROMPT = """
You are an expert agent specializing in refactoring and optimizing machine learning Python scripts. Your task is to take a Python script, parameterize its file paths, and **optimize its hyperparameters to potentially improve model performance.**

1.  First, place the following placeholders at the top of the script, before any function or class definitions:

    ```python
    TRAIN_CSV = r"{train_csv_path}"
    TEST_CSV = r"{test_csv_path}"
    OUTPUT_DIR = r"{output_dir}"
    SUBMISSION_FILENAME = "{submission_filename}"
    ```

2.  Identify any lines where pandas reads CSV files (e.g., `pd.read_csv(...)`). Replace the hard-coded paths or `os.path.join(...)` expressions with the `TRAIN_CSV` and `TEST_CSV` placeholders.

3.  Identify the location where the submission file is saved (e.g., `submission.to_csv(...)`). Replace the logic for saving the file with the following block:

    ```python
    os.makedirs(OUTPUT_DIR, exist_ok=True)
    submission_path = os.path.join(OUTPUT_DIR, SUBMISSION_FILENAME)
    submission.to_csv(submission_path, index=False)
    ```

4.  Remove or reassign any constants that hard-code directory paths (e.g., `DATA_DIR`, `SOURCE_DIR`). If they are still required for building other paths, use a generic placeholder, such as `BASE_DIR = "{base_dir}"`.

5.  **Analyze and Tune Hyperparameters (Optional but Encouraged).** Examine the script for key hyperparameters. If you identify an opportunity to improve model performance, you are encouraged to **make changes.** For example:

      * Increase the number of `epochs`.
      * Adjust the `learning_rate` (or `lr`).
      * Modify the `batch_size`.
        **You must justify your changes with a brief inline comment** in the code, for example: `# Increased epochs for better convergence.`

    Additional information:
      Code idea:
      {idea}

      Device info:
      {device_info}
6.  Update the `if __name__ == "__main__":` block so that it loads data from `TRAIN_CSV` and `TEST_CSV` and correctly uses the output path variables if required by the main training function.

7.  Preserve all other logic, imports, function definitions, and comments **not related to file paths or the hyperparameters you have modified.**

8.  Return **only** the transformed Python code, with no additional explanation. The code should be ready to execute once the placeholders are filled.
"""

SPEED_UP_CODE_PROMPT_MT = """You are given the code for training an ML model.
Your task is to modify the code **temporarily for debugging purposes** so that it runs for a very short duration (e.g., 1 batch, 1 epoch, 1 iteration).

### CRITICAL RESTRICTION
Your modifications should primarily target the **training loop** (e.g., reduce `epochs`, `training steps`, or limit training batches).

While the **validation/evaluation loop** can also be shortened for speed, you **MUST NOT** disable or remove it entirely. A quick run should still test the end-to-end pipeline, which includes at least one step of evaluation. If you modify the validation part, limit it to a few batches but preserve the overall logic.

**You MUST NOT alter the prediction or inference logic for the test set.** The modifications should only affect the training and optionally the validation process for debugging speed. For example, you should not do something like `model.predict(X_test_processed.head(DEBUG_PREDICTION_SAMPLES))`. The entire test set should be used for final predictions if such a step is present.

**The core logic and final training outcome (if the code were to run to completion) should not be altered by these debugging modifications.** The changes should only limit the amount of computation performed.

### CODE
{code}


### Answer format
You must return the modified code along with a description of the original code and the specific changes you made to reduce the dataset size.

Python code format:
```python
# your modify code here
```


JSON format:
```json
{{
    "original_code_description": "A brief description of what the original code does (e.g., loads and preprocesses sensor data, performs feature engineering on sales transactions).",
    "changes_made": "A concise description of the modifications applied to reduce the dataset size (e.g., 'reduced DataFrame size by taking the first 1000 rows for debugging', 'sliced the input array to process only 1% of the original data')."
}}
```
"""

SPEED_UP_CODE_PROMPT_FE = """You are given the code for data processing and feature engineering.
Your task is to modify the code **temporarily for debugging purposes** to significantly reduce the size of the dataset being processed.
**The core data processing logic and the integrity of the data transformations (if the code were to run on the full dataset) should not be altered by these debugging modifications.** The changes should only limit the amount of data processed.

Pay close attention to data types and only use methods that are guaranteed to work for that specific data type (e.g., slicing for lists/arrays, head() for pandas DataFrames).

### CODE

{code}


### Answer format
You must return the modified code along with a description of the original code and the specific changes you made to reduce the dataset size.

Python code format:
```python
# your modify code here
```

JSON format:
```json
{{
    "original_code_description": "A brief description of what the original code does (e.g., loads and preprocesses sensor data, performs feature engineering on sales transactions).",
    "changes_made": "A concise description of the modifications applied to reduce the dataset size (e.g., 'reduced DataFrame size by taking the first 1000 rows for debugging', 'sliced the input array to process only 1% of the original data')."
}}
```
"""

RETURN_CODE_SPEED_MT = """You are given a previously **speed-optimized** code for training an ML model. You also have a description of what the original code did and the specific changes that were made to speed it up. Your task is to **restore the code to its original state for full training and evaluation**. You must revert any modifications that limited the training loop (e.g., number of epochs, training steps, batches) or validation loop. You should only modify the parts related to limiting computation for debugging and leave all other core logic unchanged.

Pay close attention to common patterns for limiting computation, such as reduced `epochs`, `training_steps`, or data loader limits.


### Devices info
{devices_info}

### INPUT

Here's the speed-optimized code:

```python
{code}
````

This is what the original code did:
{original_code_description}

These changes were made to speed it up:
{changes_made}

You must return **only the Python code** with full training and evaluation restored. Do not include any JSON, descriptions, or additional text.

```python
# Your restored Python code goes here
```
"""

RETURN_CODE_SPEED_FE = """You are given a previously **speed-optimized** code for data processing and feature engineering, which involved truncating the dataset. You also have a description of what the original code did and the specific changes that were made to speed it up. Your task is to **restore the code to process the full dataset**. You must revert any data truncation or sampling mechanisms, ensuring the code operates on the entire dataset as originally intended. You should only modify the data selection/truncation parts and leave all other data processing logic unchanged.

Pay close attention to common data truncation patterns, such as slicing (`[:some_number]`), `.head()`, or explicit sampling.

### Devices info
{devices_info}

### INPUT

Here's the speed-optimized code:

```python
{code}
```

This is what the original code did:
{original_code_description}

These changes were made to speed it up:
{changes_made}

You must return **only the Python code** with full dataset processing restored. Do not include any JSON, descriptions, or additional text.

```python
# Your restored Python code goes here
```
"""

FINAL_PREPROCESSING = """You are an expert agent specializing in refactoring and optimizing machine learning Python scripts. Your task is to take a Python script, parameterize its file paths, and **optimize its hyperparameters to potentially improve model performance.**

1.  First, place the following placeholders at the top of the script, before any function or class definitions:

    ```python
    TRAIN_CSV = r"{train_csv_path}"
    TEST_CSV = r"{test_csv_path}"
    OUTPUT_DIR = r"{output_dir}"
    SUBMISSION_FILENAME = "{submission_filename}"
    ```

2.  Identify any lines where pandas reads CSV files (e.g., `pd.read_csv(...)`). Replace the hard-coded paths or `os.path.join(...)` expressions with the `TRAIN_CSV` and `TEST_CSV` placeholders.

3.  Identify the location where the submission file is saved (e.g., `submission.to_csv(...)`). Replace the logic for saving the file with the following block:

    ```python
    os.makedirs(OUTPUT_DIR, exist_ok=True)
    submission_path = os.path.join(OUTPUT_DIR, SUBMISSION_FILENAME)
    submission.to_csv(submission_path, index=False)
    ```

4.  Remove or reassign any constants that hard-code directory paths (e.g., `DATA_DIR`, `SOURCE_DIR`). If they are still required for building other paths, use a generic placeholder, such as `BASE_DIR = "{base_dir}"`.

5.  **Analyze and Tune Hyperparameters (Optional but Encouraged).** Examine the script for key hyperparameters. If you identify an opportunity to improve model performance, you are encouraged to **make changes.** For example:

      * Increase the number of `epochs`.
      * Adjust the `learning_rate` (or `lr`).
      * Modify the `batch_size`.
        **You must justify your changes with a brief inline comment** in the code, for example: `# Increased epochs for better convergence.`

6.  Update the `if __name__ == "__main__":` block so that it loads data from `TRAIN_CSV` and `TEST_CSV` and correctly uses the output path variables if required by the main training function.

7.  Preserve all other logic, imports, function definitions, and comments **not related to file paths or the hyperparameters you have modified.**

8.  Return **only** the transformed Python code, with no additional explanation. The code should be ready to execute once the placeholders are filled.
"""