Keywords: Large Language Models, Benchmark, Code Generation, Poker Bots
Abstract: We introduce Husky Hold'em Bench, a novel agent benchmark which combines strategic reasoning and software engineering skills. Agents are tasked with implementing poker bots which then compete in a 6-player round-robin tournament. We use a minimal 5-stage iterative refinement agent scaffold to solicit bots from current frontier models and run a poker bots tournament, averaging over several trials to reduce variance. We find that Claude 4 Sonnet tops the leaderboard, and that in general top models tended to employ balanced or aggressive play styles, while lower-ranking models tended to play more defensively. We open-source our code as well as all data from the tournament.
Submission Number: 228
Loading