# Experiment 130: Benchmarking
## Benchmarking Results


<img src="results/analysis/benchmark_plot.png" />


## Comparison of synthesis methods
<img src="results/prog-synthesis-conditions/prog-synthesis-conditions_average_top_20_rewards.jpg" />

- experiment_117_gpt4o (compares all program synthesis conditions on a single model)
	- bandit_offspring_score.mixed (100 iterations)
	- bandit_random.mixed.csv (100 iterations)
	- bandit_self_score.mixed.csv (100 iterations)
	- bandit_self_score.lle.csv (83 iterations)

## Comparison of target models
<img src="results/prog-synthesis-targets/prog-synthesis-targets_average_top_20_rewards.jpg " />

- bandit_self_score.mixed (compares multiples models under our best program synthesis condition)
	- experiment_117_gpt4o (100 iterations)
	- experiment_118_claude_sonnet (97 iterations)
	- experiment_119_claude_haiku (100 iterations)
	- experiment_120_gpt3.5 (100 iterations)
	- experiment_121_llama3-8b (93 iterations)
	- experiment_122_llama3-70b (75 iterations)

# Benchmark programs
- [data/benchmark/h4rm3l_benchmark_20240604.csv](data/benchmark/h4rm3l_benchmark_20240604.csv)
- Contains top 10 programs for all 9 synthesis experiments + 23 sota programs (113 programs total)


