ConvApparel: A Benchmark Dataset and Validation Framework for User Simulators in Conversational Recommenders

Published: 24 Jan 2026, Last Modified: 12 Jan 2026EACL 2026EveryoneCC BY 4.0
Abstract: The promise of LLM-based user simulators to improve conversational AI is hindered by a critical "realism gap," leading to systems that are optimized for simulated interactions, but may fail in the real world. We introduce ConvApparel, a new dataset of human-AI conversations designed to better understand this gap. Its unique dual-agent data collection protocol, using both "good" and "bad" recommenders, enables counterfactual validation by capturing a wide spectrum of user experiences, enriched with first-person annotations of user satisfaction. We propose a comprehensive validation framework that combines statistical alignment, a human-likeness score, and counterfactual validation to test for generalization. Our experiments reveal a significant realism gap across all simulators. However, the framework also shows that data-driven methods consistently outperform a prompted baseline, particularly in counterfactual validation where they adapt more realistically to unseen behaviors, indicating a more robust, if imperfect, user model.
Loading