Abstract: Integrating heterogeneous modalities for effective information access remains a central challenge in Information Retrieval (IR), particularly in reader-aware summarization, where user perspectives must be incorporated alongside textual and multimedia content. In this work, we present a novel augmentation framework that combines the strengths of Language Models (LMs) and multimodal models to generate holistic news summaries. Our approach seamlessly integrates textual articles, visual evidence from images, user-generated comments, and distilled insights from video streams. Through extensive experiments, we show that this LM-ensembled multimodal framework consistently surpasses specialized Video Language Models (Video LMs) in terms of coherence, informativeness, and user-sensitivity across multiple benchmarks. To further advance multimodal IR research, we extend the Reader-Aware Multi-Document Summarization (RAMDS) dataset with video components, introducing VARAMDS (Video-Augmented-RAMDS), the first resource to explicitly couple news text, imagery, reader comments, and video content. Our findings demonstrate that LM-driven augmentation not only improves multimodal summarization quality but also sets a new standard for reader-aware, comment-sensitive synthesis, bridging gaps between heterogeneous information sources and supporting richer retrieval-oriented applications in resource-constrained environments. Repository link to the dataset: https://github.com/Raghvendra-14/VARAMDS.
Loading