HugAgent: Evaluating LLMs in Simulating Individual-Level Human Reasoning on Open-Ended Tasks

Published: 23 Sept 2025, Last Modified: 22 Nov 2025LAWEveryoneRevisionsBibTeXCC BY 4.0
Keywords: human-grounded evaluation; theory of mind; belief attribution; belief updating; social reasoning; agent simulation; social simulation; semi-structured interviews; LAW framework; large language models; personalized context
TL;DR: We introduce HugAgent, a human-grounded benchmark for testing how agents infer and update beliefs in social contexts.
Abstract: Simulating human reasoning in open-ended tasks has been a long-standing aspiration in AI and cognitive science. While large language models now approximate human responses at scale, they remain tuned to population-level consensus, often erasing the individuality of reasoning styles and belief trajectories. To advance the vision of more human-like reasoning in machines, we introduce HugAgent (Human-Grounded Agent Benchmark), a benchmark for average-to-individual reasoning adaptation. The task is to predict how a specific person would reason and update their beliefs in novel scenarios, given partial evidence of their past views. HugAgent adopts a dual-track design: a synthetic track for scale and systematic stress tests, and a human track for ecologically valid, “out-loud” reasoning data. This design enables scalable, reproducible evaluation of intra-agent fidelity: whether models can capture not just what people believe, but how their reasoning evolves. Experiments with state-of-the-art LLMs reveal persistent adaptation gaps, positioning HugAgent as the first extensible benchmark for aligning machine reasoning with the individuality of human thought. Our benchmark and chatbot are open-sourced at [HugAgent](https://github.com/jajamoa/HugAgent) and [TraceYourThinking](https://github.com/jajamoa/trace-your-thinking).
Submission Type: Benchmark Paper (4-9 Pages)
Submission Number: 132
Loading