VALIDATOR_INFO_PROMPT = """
## DATA
You have a single dataset located at: {path_to_data} 

Dataset description:  
{data_description}

Sample submission head (first few lines of expected output, if applicable):  
{sample_submission_head}

## Instructions
Your task is to write Python code that flexibly inspects and summarizes this dataset—regardless of its format (e.g., tabular CSV with features and labels, image file paths and annotations, 3D coordinate files, etc.)—in order to inform a later train/test split. The code may explore any aspects you find useful, and should include, but is not limited to, the following:

- **Loading the data** in the appropriate way (e.g., `pandas.read_csv`, custom file loaders for images or 3D data, etc.)
- **Overview of dataset size**: number of records, files, or samples
- **Structure inspection**: list of columns or fields, file path patterns, directory structure
- **Type inference**: identify numeric, categorical, datetime, filepath, or other custom types
- **Missing data analysis**: counts and proportions of missing or empty entries
- **Basic statistics** for numeric fields (mean, std, min/max, quantiles) when relevant
- **Category frequencies** for categorical fields (unique counts, top values)
- **Preview samples**: show a small sample of rows or display a few images/annotations
- **Label/target distribution** if a target or label field is present
- **Potential grouping or stratification keys**: fields with few unique values
- **High‑cardinality features**: fields with very many unique values
- **Any additional exploration** you deem helpful for planning a robust split

At the end, aggregate all findings into a Python dictionary named `data_summary` and print it  

**Important**:  
- Do not perform any data splitting or model training here—only exploratory analysis.  
- The code should adapt to the data format it encounters.  

Please provide **only** the Python code wrapped in a triple-backtick block, for example:

```python
# your exploratory code here...
```
"""

VALIDATOR_ANALYZER_PROMPT = """
You have received the exploratory analysis output from the previous agent in the variable `info_about_data`.


## info_about_data:
{info_about_data}


## Your task is to:

1. **Read and interpret** the contents of `info_about_data`, which is a Python dictionary containing all the summary statistics, type inferences, distribution analyses, and any other findings.
2. **Think through** these findings—identify potential pitfalls (e.g., class imbalance, high‐cardinality features, grouping structures), and opportunities for stratification or grouping.
3. **Formulate concise, actionable recommendations** for how to split the data into training and test sets:
   - Which fields to stratify on
   - Which groups to keep together
   - How to handle rare categories or imbalanced classes
   - Any suggestions on relative proportions (e.g., 80/20 split, group‐wise splits)

**Do NOT** recommend any data cleaning, imputation, removal of missing values, or other preprocessing steps—this agent’s focus is **exclusively** on data partitioning.

You must provide both your internal reasoning and your final advice. The output **must be** a single JSON object with two keys:

```json
{{
  "thoughts": "Your detailed internal reasoning about what you saw in the data summary and how you reached your conclusions.",
  "recommendations": "A concise set of recommendations for data partitioning based on those observations."
}}
```
"""


VALIDATOR_SPLIT_PROMPT = """
You are an expert AI agent that writes Python code to prepare a dataset for model validation. Your primary task is to split a training dataset into a new training set and a validation set. A critical part of your task is to correctly handle all file operations without altering the original data format in the metadata files.

The code you generate must create three specific files in the output directory: `train.csv`, `test.csv`, and `test_submit.csv`.

## Input Parameters
- The original dataset is located at: `path_to_data = {path_to_data}`
- The output files should be saved in: `save_path = {save_path}`
- A sample submission file for format reference is at: `path_to_sample_submission = {path_to_sample_submission}`
- The maximum size of the dataset to be used is: `size_threshold = {size_threshold}`
- If the dataset exceeds the threshold, the percentage of the dataset to use is: `subset_size_in_percent = {subset_size_in_percent}`

## Primary Instructions & Source of Truth
1.  **Data Description**: Use the description below to understand the structure of the data (e.g., tabular, image-based, directory layout). This is your primary guide.
    ```
    {data_description}
    ```
2.  **Analyzer Recommendations**: Use these recommendations to guide your splitting process (e.g., for stratification).
    ```
    {recommendations}
    ```

## Your Task: Generate Python Code

Generate a single, robust Python code block that performs the following steps.

### Step 1: Load and Trim the Dataset
1.  Load the primary metadata file (e.g., the original `train.csv`).
2.  If the number of rows exceeds `size_threshold`, reduce the dataset by taking a random sample of size `subset_size_in_percent`.
3.  All subsequent steps must be performed on this potentially smaller dataset.

### Step 2: Split the Data
1.  Split the dataframe from Step 1 into two non-overlapping sets. These will become your new **`train.csv` (training set)** and **`test.csv` (validation set)**.
2.  Use the `Analyzer Recommendations` to apply techniques like stratification.

### Step 3: Manage Files and Preserve IDs
This step is critical for datasets with external files like images.
1.  **Replicate Structure**: Inside `save_path`, create a directory structure that is an **exact replica** of the file-containing directories from the original data (e.g., `train/`, `test/`).
2.  **Copy Files**: Copy the actual files (e.g., `.jpg`) for your new training split and validation split into the appropriate newly created directories (e.g., files for the new training set go into `save_path/train/`).
3.  **Preserve IDs in CSV**: The ID/filename column in your new `train.csv` and `test.csv` dataframes **must remain completely unchanged**. Do not add directory prefixes like `train/` or `test/` to the filenames in the CSV file itself. If an ID in the original data was `image.jpg`, it must remain `image.jpg` in your output CSVs.

### Step 4: Prepare the `test_submit.csv` File
1.  Use your newly created **`test.csv`** (the validation set) to generate the submission file.
2.  **Column Structure**: The submission file must use the **exact same column names and order** as the `path_to_sample_submission`. Your code must read the headers from that file.
3.  **ID Column Values**: The ID column in `test_submit.csv` must contain the **exact same values** as the corresponding ID column in the `test.csv` you just created (which, according to Step 3, are the original, unmodified filenames).
4.  **Target Values**: The prediction columns must be filled with the **correct target labels** taken directly from your `test.csv`.

### Step 5: Save All Results
1.  Save the final dataframes and the submission file to the `save_path` directory.
2.  The following three files **must** be created:
    -   `train.csv`: The metadata for your new training set, with original IDs.
    -   `test.csv`: The metadata for your new validation set, with original IDs.
    -   `test_submit.csv`: The submission file for the validation set.

Your final output must be only the Python code, wrapped in a triple-backtick block.

```python
# Your robust and generalized Python code here
```

"""

WITHOUT_RECOMMENDATION_TEXT = "The agent decided not to provide any additional information about the data"