# Configuration for vocab scaling law discovery with OpenEvolve
max_iterations: 50
checkpoint_interval: 1
log_level: "INFO"
random_seed: 42
wandb:
  enabled: true
  project: "openevolve"
  name: "vocab_scaling_law-{model}"
  group: "single_task/vocab_scaling_law/{model}"
  job_type: "single_task"
  tags: ["sldbench", "single_task", "vocab_scaling_law", "{model}"]
  mode: "online"

# LLM configuration
llm:
  primary_model: "gemini-3-flash-preview"
  primary_model_weight: 1.0
  secondary_model: null
  secondary_model_weight: 0.0
  api_base: "https://generativelanguage.googleapis.com/v1beta/openai/"
  max_tokens: 16384
  timeout: 240
  retries: 10
  retry_delay: 10

# Prompt configuration
prompt:
  system_message: |
    You are an expert in scaling laws and machine learning who specializes in discovering and improving scaling law functions for different LLM training scenarios. Your task is to evolve both the `scaling_law_func` function (currently a naive power law) and the `fit_scaling_law` optimization algorithm (currently a naive BFGS) to better model the relationship between vocabulary size, non-vocabulary parameters, number of characters and Lossu (unigram-normalized language model loss).

    **IMPORTANT: The scaling law function must use no more than 7 parameters.**

    Focus on mathematical accuracy across different vocabulary configurations, cross-dataset generalization, parameter efficiency (simple forms that can be fitted with limited data), and numerical/theoretical stability.

    **DATA CHARACTERISTICS**
    - Features: [non_vocab_parameters, vocab_size, num_characters] - 3D input
    - Labels: unigram_normalized_loss - scalar output
    - Dataset size: 1080 
    - Vocabulary size: 4096 to 96256 tokens (8 distinct sizes)
    - Embedding dimension: 512 to 2048 dimensions (4 values)
    - Character count: 1e8 to 5e12 characters (100M to 5T characters)
    - Non-vocab parameters: 3.3e7 to 1.1e9 (33M to 1.1B parameters)
    - FLOPs range: 1.3e16 to 4.4e20 operations
    - Lossu range: -5.34 to -0.51 (negative values indicate improvement over unigram)
    - Lossu measures improvement over context-free unigram model (negative = better)
    - Explores vocabulary scaling trade-offs across parameter, data, and architecture dimensions

    The function signatures must remain:

    ```python
    def scaling_law_func(data_points, params):
        # data_points: (N,3) array with columns [P_non_vocab, vocab_size, num_characters]
        # Non_vocab_parameters: Array of non-vocabulary parameter counts
        # vocab_size: Array of vocabulary sizes
        # num_characters: Array of number of characters processed
        # params: Array of up to 7 parameters
        # Returns: Predicted Lossu values

    def fit_scaling_law(data_points, loss_values):
        # data_points: (N,3) array with columns [P_non_vocab, vocab_size, num_characters]
        # Non_vocab_parameters: Array of non-vocabulary parameter counts
        # vocab_size: Array of vocabulary sizes
        # num_characters: Array of number of characters processed
        # lossu_values: Array of corresponding Lossu values
        # Returns: Optimized parameters (up to 7 parameters)
    ```

    Write all improvements between # EVOLVE-BLOCK-START and # EVOLVE-BLOCK-END markers.

    You are not allowed to use input-dependent feature in scaling_law_func, e.g., median / min / max / etc.

  num_top_programs: 3
  num_diverse_programs: 2
  use_template_stochasticity: true

# Database configuration for evolution
database:
  population_size: 100
  archive_size: 50
  num_islands: 5
  migration_interval: 25
  migration_rate: 0.1
  elite_selection_ratio: 0.1
  exploration_ratio: 0.2
  exploitation_ratio: 0.7
  feature_dimensions: ["combined_score", "complexity", "diversity"]
  feature_bins: 10

# Evaluator configuration
evaluator:
  timeout: 600
  max_retries: 3
  cascade_evaluation: false
  cascade_thresholds: [0.3, 0.6]
  parallel_evaluations: 4
  use_llm_feedback: false

# Evolution settings
diff_based_evolution: false
max_code_length: 100000
