Beyond Static Truthfulness Benchmarks: Two Truths and One Lie for Multi-Agent Deception and Detection
Keywords: trustworthy AI, truthfulness, hallucination, deception detection, multi-agent systems, automated evaluation, benchmarking, LLM alignment
TL;DR: A tournament-style alternative to static truthfulness benchmarks that measures factual robustness and cross-model detection through multi-agent 2T1L play
Abstract: Existing truthfulness benchmarks typically evaluate a single model in isolation
against static datasets or human-written reference answers. As LLMs improve,
these fixed test sets and reference-based evaluations risk saturating and provide
limited insight into how models behave when they interact, critique, or compete.
We argue for truthfulness benchmarks that can scale with increasingly capable
agents by defining performance relationally, through repeated multi-agent play.
We propose a tournament-style evaluation framework based on the party game
Two Truths and One Lie (2T1L), in which each LLM alternates between two
roles: a generator that produces two true statements and one lie, and a detector
that must identify the lie in other models’ triples. From these multi-agent inter-
actions we build an adjacency matrix over models and derive two interpretable
metrics: a LieScore measuring how difficult a model’s lies are to detect, and a De-
tectScore measuring how reliably a model identifies others’ lies. This relational,
role-swapped evaluation induces a tournament ranking over models and naturally
extends to settings with human players or tool-augmented LLMs, enabling the de-
sign of stronger multi-agent systems for improved human alignment and truthful-
ness. We present preliminary experiments with several open models, visualizing
the interaction heatmap and induced rankings, and outline how rank correlation
across topics can reveal domain-specific strengths and weaknesses. Our results
suggest that game-based, multi-agent benchmarks can mitigate limitations of both
static exams and preference-based leaderboards, and are well-suited for assessing
LLMs in agentic settings
Submission Number: 261
Loading