Shared Contexts, Personalized Outputs: A Benchmark for Document Generation

Shared Contexts, Personalized Outputs: A Benchmark for Document Generation

ICLR 2026 Conference Submission13426 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Dataset and Benchmark, Document Generation, Personalization, Long-context, Natural Language Generation, Large Language Mode

Abstract: Large Language Models (LLMs) have recently demonstrated strong capabilities in long-context text generation, enabling applications such as meeting summarization and multi-document question answering. However, these tasks typically focus on producing a single, context-consistent output, without accounting for user-specific roles, preferences, or intents. Personalized Contextual Document Generation (PCDG) requires models to generate distinct, user-tailored documents grounded in the same extended context. Generating user-tailored outputs is key to adaptive applications, reducing manual edits and improving downstream utility, yet this capability remains underexplored due to the difficulty of evaluation. Furthermore, benchmarking PCDG effectively demands realistic, controllable context modeling, explicit personalization signals, well-defined intermediate sub-tasks, and evaluation metrics that go beyond surface-level similarity. To this end, we present PersonaContextWeaver, a benchmarking framework designed to meet these requirements through three key innovations: (1) a knowledge-graph-based synthesis pipeline that generates rich, multi-user, cross-domain conversational contexts with controllable personalization variables; (2) a task decomposition strategy that evaluates not only final document quality but also intermediate reasoning steps including as intent detection, context filtering, reference prediction; and (3) a multi-dimensional evaluation protocol to evaluate LLM's capability of user intent/profile understanding, relevant context retrieval, and document customization. Empirical evaluation of state-of-the-art LLMs on PersonaContextWeaver reveals substantial gaps in their ability to consistently generate highly personalized, contextually accurate documents. Models often struggle with nuanced user modeling, context filtering, and reference integration, indicating that personalized contextual document generation remains a challenging frontier for current LLMs.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 13426

Loading