Exploring Recommender System Evaluation: A Multi-Modal LLM Agent Framework for A/B testing

Exploring Recommender System Evaluation: A Multi-Modal LLM Agent Framework for A/B testing

ACL ARR 2025 February Submission6121 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: In recommender systems, online A/B testing is a crucial method for evaluating the performance of different models. However, conducting online A/B testing often presents significant challenges, including substantial economic costs, user experience degradation, and considerable time requirement. With the Large Language Models' powerful capacity, LLM-based agent shows great potential to replace traditional online A/B testing. Nonetheless, current agents fail to simulate the perception process and interaction patterns, due to the lack of real environments and visual perception capability. To address these challenges, we introduce a multi-modal user agent for A/B testing (A/B Agent). Specifically, we construct a recommendation sandbox environment for A/B testing, enabling multimodal and multi-page interactions that align with real user behavior on online platforms. The designed agent leverages multimodal information perception, fine-grained user preferences, and integrates profiles, action memory retrieval, and a fatigue system to simulate complex human decision-making. We validated the potential of the agent as an alternative to traditional A/B testing testing from three perspectives: model, data, and features. Additionally, we found that the data generated by \name can effectively enhance the capabilities of recommendation models. Our code is public abailable~\footnote{\url{https://anonymous.4open.science/r/MMAgent-D8E2/}}.

Paper Type: Long

Research Area: NLP Applications

Research Area Keywords: multimodal applications

Contribution Types: NLP engineering experiment, Reproduction study

Languages Studied: English

Submission Number: 6121

Loading