Diversity Augmentation of Dynamic User Preference Data for Boosting Personalized Text Summarizers

Published: 24 Oct 2025, Last Modified: 24 Oct 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Document summarization facilitates efficient identification and assimilation of user-relevant content, a process inherently influenced by individual subjectivity. Discerning $\textit{subjective}$ salient information within a document, particularly when it has multiple facets, poses significant challenges. This complexity underscores the necessity for $\textit{personalized summarization}$. However, training models for personalized summarization has so far been challenging, particularly because diverse training data containing both user preference history (i.e., $\textit{click-skip}$ trajectory) and expected (gold-reference) summaries are scarce. The MS/CAS PENS dataset is a rare resource in this direction. However, the training data only contains preference history $\textit{without any target summaries}$, thereby blocking end-to-end supervised learning. Also, the diversity in terms of topic transitions along the trajectory is relatively low, thereby leaving scope for better generalization. To address this, we propose PerAugy, a novel $\textit{cross-trajectory shuffling}$ and $\textit{summary-content perturbation}$-based data augmentation technique that significantly boosts the accuracy of four state-of-the-art (SOTA) baseline user-encoders commonly used in personalized summarization frameworks (\text{best result}: $\text{0.132}$$\uparrow$ w.r.t AUC). We select two such SOTA summarizer frameworks as baselines and observe that when augmented with their corresponding improved user-encoders, they consistently show an increase in personalization ($\text{avg. boost}$: ${61.2\%}\uparrow$ w.r.t. PSE-SU4 metric). As a post-hoc analysis of the role of induced diversity in the augmented dataset by PerAugy, we introduce three dataset diversity metrics -- $\mathrm{TP}$, $\mathrm{RTC}$, and DegreeD to quantify the induced diversity. We find that $\mathrm{TP}$ and DegreeD have a strong correlation with the user-encoder performance when trained on the PerAugy-generated dataset across all accuracy metrics, indicating that the increase in dataset diversity plays a major role in performance gain.
Submission Length: Long submission (more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=GYNFv93NbJ
Changes Since Last Submission: Dear Action Editors and Reviewers, We thank you all for your invaluable feedback and timely support. Your comments have gone a long way in making the paper a lot more structured and comprehensive. We are now submitting the camera-ready version along with a link to the code and video presentation. We are actively involved in continuing on this thread of research, and we hope to communicate further improvements on this. Best
Video: https://tinyurl.com/PerAugy-KDMLab
Code: https://github.com/KDM-LAB/PerAugy-TMLR/
Supplementary Material: zip
Assigned Action Editor: ~Branislav_Kveton1
Submission Number: 5306
Loading