# Targeted (gpt4-o) few shots program synthesis with bandits
<img src="results/analysis/prog-synthesis/average_top_20_rewards.jpg" />
^ Results. Average score of top 20 generated programs (from first to current iteration).

## Experiment details:
- Conditions
    - [x] bandits_random.mixed
    - [x] bandits_self_score.mixed
    - [x] bandit_offspring_score.mixed
    - [x] bandits_self_score.lle
- Conditions vary by:
    - Primitives used (mixed, lle)
    - Few shots-example selection method
        - bandits_random (random selection of examples)
        - bandits_self_score 
            - Sample w.r.t p ~ Beta(w*score, w*(1-score))
            - w = 2 *iteration
        - bandit_offspring_score
            - Sample w.r.t. p ~ Beta(success_count, failure_count)
            - At each iteration, update each example such that
                - success_count += mean(score(offsprings))
                - failure_count += 1 - mean(score(offsprings))
    - See implementation details in [synthesizer.py](../../h4rm3l/src/h4rm3l/synthesizer.py)
- Common parameters for all conditions:
    - 15 selected examples per iteration
    - 100 iterations (but system stoped working around 40. Exceeded API thoughput quota)
    - 20 program proposals per iteration
    - 30 proposals per batch
    - 5 eval prompts (randomly sampled during validation)
    - Dynamic example pool update
        - Synthesized programs that score above the example pool average are added to the pool to serve as potential examples for subsequent iterations
    - Used LLMs:
        - Target model: gpt4-o
        - Synthesis model: gpt4
            - also used for decoration during synthesis
        - Moderation model: gpt4

## Results
- Example selection based on program score works best
- Example selection based on offspring score did not ramp up fast (perhpas beta params should be scaled with iterations like in the self score case)
- lle representation did not work as well as mixed (hle + lle)

## Artifacts
- Initial examples:
    - [config/program_examples_lle.csv](config/program_examples_lle.csv)
    - [config/program_examples_mixed.csv](config/program_examples_mixed.csv)
- Program Synthesis curves:
    - [Program synthesis curves](results/analysis/prog-synthesis/)
- Synthesized programs:
    - [data/synthesized_programs/](data/synthesized_programs/)
- Selected synthesized pogs for validation
    - [exp_116_selected_progs_claude.csv](exp_116_selected_progs_claude.csv)
- 20 Harmful prompts for validation: 
    - [data/valid_harmful_prompts.csv](data/valid_harmful_prompts.csv)
- Decorated valid prompts:  
    - [data/decorated_valid_harmful_prompts.csv](data/decorated_valid_harmful_prompts.csv)
- Benchmark results:
    - [results/analysis/](results/analysis/) 


valid_eval_claude-3-haiku-20240307.csv

# Reproducing this experiment
- Sample train harmful prompts:
```
make data/train_harmful_prompts.csv
```
- Synthesize programs
```
make data/synthesized_programs/syn_progs.bandit_self_score.mixed.csv 
make data/synthesized_programs/syn_progs.bandit_offspring_score.mixed.csv
make data/synthesized_programs/syn_progs.bandit_random.mixed.csv
make data/synthesized_programs/syn_progs.bandit_self_score.lle.csv
```
- Generate harmful prompts
```
make data/decorated_valid_harmful_prompts.csv
```
- Evaluate
```
make results/valid_eval_claude-3-haiku-20240307.csv
```

