JIR-Arena: The First Comprehensive Benchmark Dataset for Just-in-time Information Recommendation

JIR-Arena: The First Comprehensive Benchmark Dataset for Just-in-time Information Recommendation

ICLR 2026 Conference Submission22608 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: information access, AI agent, user simulation

TL;DR: We formally define Just-in-time Information Recommendation (JIR), introduce JIR-Arena—the first large-scale multimodal benchmark for JIR systems—and provide baseline evaluations, code, and data to enable future development in this emerging area.

Abstract: Just-in-time Information Recommendation (JIR) is a service that delivers the most relevant information precisely when users need it the most. It plays a critical role in filling users' information gaps during pivotal moments like those in learning, work, and social interactions, thereby enhancing decision-making quality and life efficiency with minimal user effort. Recent device-efficient deployment of performant foundation models and the proliferation of intelligent wearable devices have made the realization of always-on JIR assistants feasible. However, despite the potential of JIR systems to transform our daily life, there has been little prior systematic effort to formally define JIR tasks, establish evaluation frameworks, or propose a large-scale multimodal benchmark with high-quality multi-party-sourced ground-truth labels. To bridge this gap, we present a comprehensive mathematical definition of JIR tasks and their associated evaluation metrics. Furthermore, we introduce JIR-Arena, the first multimodal JIR benchmark dataset with diverse and information-request-intensive scenarios, designed to evaluate JIR systems across multiple dimensions, including whether they can i) accurately infer user information needs, ii) provide timely and helpfully relevant recommendations, and iii) effectively avoid the inclusion of irrelevant content that might distract users. Constructing a JIR benchmark is challenging due to the subjectivity of user information needs and the difficulty of achieving reproducible evaluations. To overcome these, our benchmark approximates user need distribution by combining human and large AI model inputs, and enhances objectivity through a multi-turn validation framework. Additionally, we ensure assessment reproducibility by evaluating information recommendation outcomes against static knowledge bases. We also develop a baseline JIR system architecture, and instantiate it with several large foundation models. Our evaluation of the baselines on JIR-Arena reveals that while large foundation model-based JIR systems can simulate user needs with reasonable precision, they struggle with recall and effective content retrieval. Finally, to facilitate future development of JIR systems and exploration of more JIR application scenarios, we release our code and data in the supplementary materials.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 22608

Loading