To See or To Read: User Behavior Reasoning in Multimodal LLMs

Tianning Dong; Luyi Ma; Varun Vasudevan; Jason Cho; Sushant Kumar; Kannan Achan

To See or To Read: User Behavior Reasoning in Multimodal LLMs

Tianning Dong, Luyi Ma, Varun Vasudevan, Jason Cho, Sushant Kumar, Kannan Achan

Published: 16 Oct 2025, Last Modified: 10 Nov 2025NeurIPS 2025 ER WorkshopEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal LLMs, next purchase prediction, recommendation system, personalization, reasoning efficiency

TL;DR: We present BehaviorLens, a systematic benchmarking framework for assessing modality trade-offs in user-behavior reasoning in MLLMs.

Abstract: Multimodal Large Language Models (MLLMs) are reshaping how modern agentic systems reason over sequential user-behavior data. However, whether textual or image representations of user behavior data are more effective for maximizing MLLM performance remains underexplored. We present BehaviorLens, a systematic benchmarking framework for assessing modality trade-offs in user-behavior reasoning across six MLLMs by representing transaction data as (1) a text paragraph, (2) a scatter plot, and (3) a flowchart. Using a real-world purchase-sequence dataset, we find that when data is represented as images, MLLMs next-purchase prediction accuracy is improved by 87.5\% compared with an equivalent textual representation without any additional computational cost.

Submission Number: 274

Loading