Husky Hold'em Benchmark: Can LLMs Design Competitive Poker Bots?

Published: 24 Sept 2025, Last Modified: 28 Nov 2025NeurIPS 2025 LLM Evaluation Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models, Benchmark, Code Generation, Poker Bots
Abstract: We introduce Husky Hold'em Bench, a novel agent benchmark which combines strategic reasoning and software engineering skills. Agents are tasked with implementing poker bots which then compete in a 6-player round-robin tournament. We use a minimal 5-stage iterative refinement agent scaffold to solicit bots from current frontier models and run a poker bots tournament, averaging over several trials to reduce variance. We find that Claude 4 Sonnet tops the leaderboard, and that in general top models tended to employ balanced or aggressive play styles, while lower-ranking models tended to play more defensively. We open-source our code as well as all data from the tournament.
Submission Number: 228
Loading