Privacy-Preserving Document Summarization via Sensitive-Word Detection and Masked Content Reconstruction

Privacy-Preserving Document Summarization via Sensitive-Word Detection and Masked Content Reconstruction

ACL ARR 2026 January Submission9963 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Privacy-Preserving NLP, Document Summarization, Sensitive Token Detection

Abstract: Identifying and protecting sensitive information in documents is crucial when using cloud-based summarization services. In this study, we introduce a privacy-preserving document summarization framework that first detects potentially sensitive words and then masks them before submitting the document to the AI server. A custom classification model is trained to recognize sensitive terms using a rich feature set. In the proposed pipeline, the sensitive words detected in the original (plain) text are replaced with mask tokens before sending the text to a cloud-based summarization model. The masked text summary is then reconstructed in the user environment by aligning each masked token with its original word using context windows and semantic similarity. Experiments on a labeled dataset of sensitive documents show that using our method, the user can correctly recover 93.87\% of masked content in the AI-generated summaries, demonstrating the effectiveness of the proposed masking and reconstruction strategy. Moreover, when masking as much as 50\% of sensitive words, the ROUGE-1 quality of multiple summarization models shows an average decrease of only 20\% and a drop of about 0.04 in BERTScore compared to the original text summaries.

Paper Type: Long

Research Area: Summarization

Research Area Keywords: extractive summarisation; abstractive summarisation;evaluation

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources

Languages Studied: English

Submission Number: 9963

Loading