Keywords: Reinforcement Learning, Human Behavior Simulation, LLM
Abstract: Simulating step-wise human behavior with Large Language Models (LLMs) has become an emerging research direction in various domain. While prior methods, including prompting, supervised fine-tuning (SFT), and reinforcement learning (RL), have shown promise in modeling step-wise behavior, they primarily learn a population-level policy without conditioning on a user’s persona, yielding generic rather than personalized simulations. In this work, we pose a critical question: to what extent can LLM agents simulate personalized user behavior, i.e., predict a user’s next action given that user’s persona and interaction history, and how can this ability be enhanced? We introduce \projectname, a RL-based method for personalized, step-wise user behavior simulation in online shopping environments. Our policy is conditioned on an explicit persona by prepending a structured persona block to each step’s prompt, and we optimizes next step rationale and action generation via action correctness reward signals. Experiments on the OPeRA dataset demonstrate that \projectname not only significantly outperforms prompting and SFT-based baselines in next-action prediction tasks, but also better matches individual user's action distribution, indicating higher fidelity in personalized behavior simulation.
Submission Number: 164
Loading