HORIZON: A Benchmark for In-the-wild User Behaviour Modeling

Arnav Goel; Pranjal A Chitale; Bhawna Paliwal; Bishal Santra; Amit Sharma

HORIZON: A Benchmark for In-the-wild User Behaviour Modeling

Arnav Goel, Pranjal A Chitale, Bhawna Paliwal, Bishal Santra, Amit Sharma

Published: 23 Sept 2025, Last Modified: 08 Nov 2025BERT2SEveryoneRevisionsBibTeXCC BY 4.0

Keywords: user-modeling, temporal-generalization, user-recommendation, sequential-recommendation, long-horizon, ood-evaluation

TL;DR: HORIZON establishes robust evaluation practices for user behavior modeling, emphasizing temporal generalization alongside cross-domain and unseen-user adaptation.

Abstract: User-interaction sequences in modern recommendation systems often showcase complex temporal patterns that pose fundamental challenges in time series modeling. However, existing user modeling approaches consider benchmarks mostly focused on short-term, single-domain next-item prediction, employing in-distribution evaluation practices that fail to assess true temporal / cross-user OOD generalization capabilities. Furthermore, most benchmarks use leave-one-out or ratio-based splits that risk temporal leakage and reward models which exploit short-range correlations rather than capturing evolving user preferences. We introduce HORIZON, a large-scale benchmark designed to establish robust evaluation practices for sequential recommendation and user behavior modeling. Built as a cross-domain reformulation of Amazon Reviews Dataset, it covers 54M users, 35M items, and 486M interactions, enabling both pre-training as well as rigorous out-of-distribution evaluations. HORIZON tests three core capabilities essential for real-world deployment: (i) long-term temporal generalization, (ii) cross-domain transfer, and (iii) unseen-user generalization for cold-start settings. Our results demonstrate that while traditional baselines (e.g., BERT4Rec) perform well in-domain, they significantly degrade under temporal and unseen-user scenarios, and even state-of-the-art LLMs struggle in this task. Our findings underscore the gap between current models and the complex temporal, cross-domain nature of real-world user behavior.

Submission Number: 29

Loading