Abstract: Large Language Models (LLMs) demonstrate significant potential in multi-agent negotiation tasks, yet evaluation in this domain remains challenging due to a lack of robust and generalizable benchmarks. Abdelnabi et al. (2024) introduce a negotiation benchmark based on Scoreable Games, with the aim of developing a highly complex and realistic evaluation framework for LLMs. The following work investigates the reproducibility of several claims in their benchmark, and extends their analysis to provide a deeper understanding of fairness, interpretability, and generalizability within the benchmark. We replicate the original experiments on a multitude of additional models, and introduce additional metrics to illuminate actual negotiation quality and fairness. Our findings reveal that while the benchmark is indeed complex, model comparison is ambiguous, raising questions about its objectivity. Furthermore, we identify limitations in the experimental setup, particularly in information leakage detection and ablation transferability, which impact the robustness of the results. By examining and analyzing the behavior of a wider range of models on an extended version of the benchmark, we reveal key insights that provide much-needed context to potential users. Our results highlight the importance of context in model-specific evaluations and the need for more nuanced metrics to assess negotiation performance.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: Quanquan Gu
Submission Number: 4295
Loading