# Evolution Strategies at Scale: LLM Fine-Tuning Beyond Reinforcement Learning


## Environment Setup

```bash
# Create and activate conda environment
conda env create -f environment.yaml
conda activate es
```

## Usage

### 1. Model Training

Multi-GPU distributed training:

```bash
accelerate launch \
    --num_processes <NUM_GPUS> \
    --num_machines <NUM_MACHINES> \
    --machine_rank <MACHINE_RANK> \
    --mixed_precision bf16 \
    es_llm_countdown.py \
    --data_sample <SAMPLE_SIZE> \
    --model_name Qwen/Qwen2.5-3B-Instruct \
    --reward_type grpo_reward \
    --hf_cache_dir /path/to/your/hf_cache \
    --gpu_threads <GPU_THREADS> \
    --mixed_precision bf16
```

Parameters:
- `<NUM_GPUS>`: Number of GPU processes to use
- `<NUM_MACHINES>`: Number of machines in distributed setup (usually 1 for single machine)
- `<MACHINE_RANK>`: Rank of current machine (usually 0 for single machine)
- `<SAMPLE_SIZE>`: Number of training data samples
- `<GPU_THREADS>`: Number of GPU threads per process
- `--model_name`: Base model name
- `--reward_type`: Reward function type (`grpo_reward` or `toy_reward`)
- `--hf_cache_dir`: HuggingFace cache directory

### 2. Model Inference

Evaluate trained models:

```bash
python inference_countdown.py \
    --model_path /path/to/saved/model \
    --base_model_name "Qwen/Qwen2.5-3B-Instruct" \
    --reward_type "grpo_reward" \
    --train_samples <TRAIN_SAMPLES> \
    --eval_samples <EVAL_SAMPLES> \
    --eval_offset <EVAL_OFFSET> \
    --mixed_precision "fp16" \
    --max_new_tokens <MAX_TOKENS> \
    --batch_size <BATCH_SIZE> \
    --verbose \
    --save_responses \
    --generate_plots \
    --show_examples <NUM_EXAMPLES>
```

Parameters:
- `--model_path`: Path to trained model
- `<TRAIN_SAMPLES>`: Number of training samples used during training
- `<EVAL_SAMPLES>`: Number of evaluation samples
- `<EVAL_OFFSET>`: Offset for evaluation data
- `<MAX_TOKENS>`: Maximum number of tokens to generate
- `<BATCH_SIZE>`: Batch size for inference
- `<NUM_EXAMPLES>`: Number of examples to display
- `--generate_plots`: Generate performance plots
- `--save_responses`: Save model responses to file


For RL experiments, we run the countdown task based on TinyZero (PPO) and GRPO-Zero (GRPO)
