# 📜 Config Structure

This document describes the parameters available in the `config.yaml` file.

## General Description

### General Settings

```yaml
model_name: 'gemini-2.5-flash'    # The name of the API model you want to use for the run.
model_type: "gemini"             # The model type is needed for correct server processing. This allows easy switching of the LLM backend in the future (e.g., to gpt4-o).
competition: "titanic"           # The competition to run.
log_dir: logs                    # The name of the directory for saving logs.
```

### General Parameters

```yaml
need_format_pipeline: false      # Code formatting after tree initialization, does not affect logic, improves readability of intermediate results.
time_run_minutes: 100            # Total pipeline runtime in minutes.
timeout: 60                      # Time in seconds for an LLM to respond.
runtime_error_time: 10           # Time in seconds for code to run (if exceeded, a RuntimeError is triggered).
top_N_for_running_on_test: 0     # Number of nodes for intermediate validation on the real test set, if it is known.
need_test_score: false           # Similar to top_N_for_running_on_test; whether intermediate validation on the real test set is needed.
subset_size_in_percent: 10       # Percentage of the dataset to use as a subset if the data is large.
validator_size_threshold: "10^5" # The number of rows in train.csv that determines if the data is large or not.
CUDA_VISIBLE_DEVICES: "6"        
need_pro_validator: false        # Improved validation with preliminary data analysis and reasoning.
dynamic_model: true              # Use different models for different tasks.
```

### Number of Attempts

**The number of attempts allocated for each agent.**

```yaml
number_of_attempts_reader: 1      # Attempts to generate contextual information about the competition.
number_of_attempts_validator: 3   # Attempts to generate code to split data into validation and training sets.
number_of_attempts_scorer: 3      # Attempts to generate code to implement a metric for solution evaluation.
number_of_attempts_baseline: 5    # Attempts to generate baseline code.
number_of_attempts_insight: 10    # Attempts to generate ideas for each stage.
number_of_attempts_coder: 5       # Attempts to implement an idea.
number_of_attempts_checker: 3     # Number of attempts for the checker agent to generate output.
number_of_attempts_install: 5     # Number of attempts to install libraries.
number_of_attempts_debug: 3       # Attempts to debug the code.
number_of_attempts_debug_submit: 3 # Attempts to debug the submission.
```

### Insight

**The number of ideas generated during a standard run.**

```yaml
number_of_ideas_eda: 2            # Number of ideas for exploratory data analysis.
number_of_ideas_data: 2           # Number of ideas for the data preprocessing stage.
number_of_ideas_modelling: 1      # Number of ideas for the model training stage.
insight_delay: 5                  # Time delay in seconds after processing one idea.
```

### Adding

**Various algorithms:**

```yaml
# Available algorithms:
# algorithm_type_adding: base_random_adding
# algorithm_type_adding: "adding_with_groups_split"
# algorithm_type_adding: "top_n_adding"
# algorithm_type_adding: adding_probability_distribution
algorithm_type_adding: "adding_greedy_epsilon" #
adding_epsilon: 0.3                   #
number_of_selected_node: 1            # Number of nodes to which new ideas will be added.
max_add_idea: 1                       # Maximum number of new ideas.
top_n_for_eda: 4                      # Top-n Best and Top-n Worst Ideas for Context When Adding EDA
```

### Merging

```yaml
algorithm_type_merging: "greedy_algorithm" #
number_of_iterations_parents: 2       
number_of_selected_node_merging: 2    
number_of_iterations_children: 3      
merging_epsilon: 0.3                  
max_memory_long: 3                    # The threshold at which a combination of ideas is moved to the blacklist
max_iteration_without_update_best_score: 3 # Only for greedy_algorithm
```

### Debugger

```yaml
number_of_attempts_debug_generate: 3  
max_count_install_error: 1            
max_count_of_identical_errors: 3      
number_of_iter_for_code_regeneration: 3 
debug_mode: "holistic"                # Available algorithms: 'holistic', 'three-stage phase'
debug_speed_mode: 'fast'              # 'standart' for slower but more accurate debugging, 'fast' for faster debugging.
```

### Scoring Model

```yaml
need_scoring_predict: null            # Is a scoring model needed?
anchor_examples: true                 # Do I need to look at other examples to predict the speed?
max_minutes_to_run_for_complex_training: 5 # Time in seconds for complex training ideas to run
number_of_ideas_min: 2                # Minimum number of ideas for initial training.
number_of_ideas_max: 3                # Maximum number of ideas for initial training.
is_scoring_model_test: null           # Is it necessary to execute the code after the prediction, to check the correctness of the predictions?
```

### Rag Agent

```yaml
use_rag: true                         # Is a rag needed?
retrieve_n_papers: 3                  # How many papers to retrieve for the initial pool of ideas.
retrieve_n_competitions: 3            # How many competitions to retrieve for the initial pool of ideas.
number_rag_ideas: 5                   # How many RAG ideas the LLM will use.
competitions_path: "kaggle_database/competitions_ideas.json"   # Path to the stored competition data in JSON format.
vector_index_path: "kaggle_database/ideas_vectorbase.faiss"     # Path to the FAISS index file containing competition embeddings.
metadata_path: "kaggle_database/ideas_metadata.json"          # Path to a JSON file mapping FAISS index to (link, description).
```

### Memory

```yaml
memory_size: 5                        # Memory size
memory_algorithm: "random_nodes"      # Available algorithms: "random_nodes", "distant_nodes", "nearest_nodes" or None.
```