Keywords: Multi Agent, Eval, LLM, Auction, Bluffing
TL;DR: We propose an llm benchmark about strategic ressource management and bluffing inspired by the game Kuhhandel/ Bluff it.
Abstract: Standard benchmarks evaluate LLM knowledge and single-agent reasoning, but
miss the capabilities required for real-world strategic interaction: bluffing, negoti-
ation, and resource management on a long term basis. Existing game benchmarks
isolate individual skills, such as deception in Werewolf or bidding in simple auc-
tions, rather than requiring their integrated deployment. We introduce CATTLE
TRADE, a benchmark based on the card game Kuhhandel1 that integrates com-
petitive auctions, hidden-information trades, and deceptive offers within 50–60
turn games. We evaluate 6 frontier LLMs across 33 games and find that strategic
commitment, measured through offer values in trades and buy-right exercise rates,
strongly predicts success, while pure bluffing strategies underperform.
Submission Number: 75
Loading