Abstract: Document summarization facilitates efficient identification and assimilation of user-relevant content, a process inherently influenced by individual subjectivity. Discerning $\textit{subjective}$ salient information within a document, particularly when it has multiple facets, poses significant challenges. This complexity underscores the necessity for $\textit{personalized summarization}$. However, training models for personalized summarization has so far been challenging, particularly because diverse training data containing both user preference history (i.e., $\textit{click-skip}$ trajectory) and expected (gold-reference) summaries are scarce. The MS/CAS PENS dataset is a rare resource in this direction. However, the training data only contains preference history $\textit{without any target summaries}$, thereby blocking end-to-end supervised learning. Also, the diversity in terms of topic transitions along the trajectory is relatively low, thereby leaving scope for better generalization. To address this, we propose PerAugy, a novel $\textit{cross-trajectory shuffling}$ and $\textit{summary-content perturbation}$-based data augmentation technique that significantly boosts the accuracy of four state-of-the-art (SOTA) baseline user-encoders commonly used in personalized summarization frameworks (\text{best result}: $\text{0.132}$$\uparrow$ w.r.t AUC). We select two such SOTA summarizer frameworks as baselines and observe that when augmented with their corresponding improved user-encoders, they consistently show an increase in personalization ($\text{avg. boost}$: ${61.2\%}\uparrow$ w.r.t. PSE-SU4 metric). As a post-hoc analysis of the role of induced diversity in the augmented dataset by PerAugy, we introduce three dataset diversity metrics -- $\mathrm{TP}$, $\mathrm{RTC}$, and DegreeD to quantify the induced diversity. We find that $\mathrm{TP}$ and DegreeD have a strong correlation with the user-encoder performance when trained on the PerAugy-generated dataset across all accuracy metrics, indicating that the increase in dataset diversity plays a major role in performance gain.
Submission Length: Long submission (more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=GYNFv93NbJ
Changes Since Last Submission: Dear Action Editors and Reviewers,
We thank you all for your thoughtful and constructive feedback. We have responded to each concern of every reviewer point-by-point (given in the "PerAugy_Review-Response_Final.pdf" file of the supplementary zip file attached).
**The main file is color-coded as follows**:
Common concerns --- blue
Reviewer qsyo --- teal
Reviewer hXxr --- dark orange
Reviewer si1k --- dark green
**The supplementary zip file contains**:
1. **PerAugy_Review-Response_Final.pdf** -- Response to every review comment. We present the reviewer’s concern in *red italics*, followed by our response in black; *blue italicized excerpts* are verbatim insertions added from the revised manuscript.
2. **PerAugy_Marked-Revision_Final.pdf** -- Copy of the revised manuscript where additions and modifications are marked in *blue*.
3. **PerAugy Codebase Revised.zip** -- Modified codebase with code of added experiments.
Please feel free to revert in case of any query. Thanks again for the invaluable feedback. It will go a long way towards our research efforts. We hope to get a positive engagement, now that we have provided the clarifications and further experimental results.
Best
Assigned Action Editor: ~Branislav_Kveton1
Submission Number: 5306
Loading