HORIZON: A Benchmark for In-the-wild User Behaviour Modeling

Arnav Goel; Pranjal A Chitale; Bhawna Paliwal; Bishal Santra; Amit Sharma

HORIZON: A Benchmark for In-the-wild User Behaviour Modeling

Arnav Goel, Pranjal A Chitale, Bhawna Paliwal, Bishal Santra, Amit Sharma

Published: 24 Sept 2025, Last Modified: 24 Sept 2025NeurIPS 2025 LLM Evaluation Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: user-modeling, temporal-generalization, user-recommendation, sequential-recommendation, long-horizon, ood-evaluation

TL;DR: HORIZON establishes robust evaluation practices for user behavior modeling, emphasizing temporal generalization alongside cross-domain and unseen-user adaptation.

Abstract: User-interaction sequences in modern recommendation systems often showcase complex temporal dynamics and evolving preferences, presenting challenges for reliable evaluation. Most existing benchmarks focus on short-term, single-domain prediction and use in-distribution splits, which fail to test temporal and cross-user generalization. Standard evaluation practices often rely on leave-one-out or ratio-based splits, leading to temporal leakage and rewarding models that exploit short-range correlations rather than capturing true user behavioral evolution. We introduce HORIZON, a large-scale benchmark designed to establish robust evaluation practices for sequential recommendation and user behavior modeling. Built as a cross-domain reformulation of Amazon Reviews Dataset, it covers 54M users, 35M items, and 486M interactions, enabling both pre-training as well as rigorous out-of-distribution evaluations. HORIZON enables systematic evaluation of three critical capabilities: (i) long-term temporal generalization as user preferences naturally shift and mature over time, (ii) cross-domain transfer reflecting users' expanding and diversifying interests, and (iii) cold-start generalization capturing behavioral patterns that emerge with new users. Our results demonstrate that while traditional baselines (e.g., BERT4Rec) perform well under traditional evaluation, they significantly degrade under temporal and unseen-user scenarios, and even state-of-the-art LLMs struggle in this task highlighting the gap between current models and the complex temporal, cross-domain nature of real-world user behavior.

Submission Number: 200

Loading