HugAgent: Evaluating LLMs in Simulating Individual-Level Human Reasoning on Open-Ended Tasks

Chance Jiajie Li; Zhenze Mo; Yuhan Tang; Ao Qu; Jiayi Wu; Kaiya Ivy Zhao; Yulu Gan; Jie Fan; Jiangbo Yu; Jinhua Zhao; Paul Pu Liang; Luis Alberto Alonso Pastor; Kent Larson

HugAgent: Evaluating LLMs in Simulating Individual-Level Human Reasoning on Open-Ended Tasks

Chance Jiajie Li, Zhenze Mo, Yuhan Tang, Ao Qu, Jiayi Wu, Kaiya Ivy Zhao, Yulu Gan, Jie Fan, Jiangbo Yu, Jinhua Zhao, Paul Pu Liang, Luis Alberto Alonso Pastor, Kent Larson

Published: 23 Sept 2025, Last Modified: 22 Nov 2025LAWEveryoneRevisionsBibTeXCC BY 4.0

Keywords: human-grounded evaluation; theory of mind; belief attribution; belief updating; social reasoning; agent simulation; social simulation; semi-structured interviews; LAW framework; large language models; personalized context

TL;DR: We introduce HugAgent, a human-grounded benchmark for testing how agents infer and update beliefs in social contexts.

Abstract: Simulating human reasoning in open-ended tasks has been a long-standing aspiration in AI and cognitive science. While large language models now approximate human responses at scale, they remain tuned to population-level consensus, often erasing the individuality of reasoning styles and belief trajectories. To advance the vision of more human-like reasoning in machines, we introduce HugAgent (Human-Grounded Agent Benchmark), a benchmark for average-to-individual reasoning adaptation. The task is to predict how a specific person would reason and update their beliefs in novel scenarios, given partial evidence of their past views. HugAgent adopts a dual-track design: a synthetic track for scale and systematic stress tests, and a human track for ecologically valid, “out-loud” reasoning data. This design enables scalable, reproducible evaluation of intra-agent fidelity: whether models can capture not just what people believe, but how their reasoning evolves. Experiments with state-of-the-art LLMs reveal persistent adaptation gaps, positioning HugAgent as the first extensible benchmark for aligning machine reasoning with the individuality of human thought. Our benchmark and chatbot are open-sourced at [HugAgent](https://github.com/jajamoa/HugAgent) and [TraceYourThinking](https://github.com/jajamoa/trace-your-thinking).

Submission Type: Benchmark Paper (4-9 Pages)

Submission Number: 132

Loading