Privacy-Preserving Document Summarization via Sensitive-Word Detection and Masked Content Reconstruction
Keywords: Privacy-Preserving NLP, Document Summarization, Sensitive Token Detection
Abstract: Identifying and protecting sensitive information in documents is crucial when using cloud-based summarization services. In this study, we introduce a privacy-preserving document summarization framework that first detects potentially sensitive words and then masks them before submitting the document to the AI server. A custom classification model is trained to recognize sensitive terms using a rich feature set.
In the proposed pipeline, the sensitive words detected in the original (plain) text are replaced with mask tokens before sending the text to a cloud-based summarization model. The masked text summary is then reconstructed in the user environment by aligning each masked token with its original word using context windows and semantic similarity.
Experiments on a labeled dataset of sensitive documents show that using our method, the user can correctly recover 93.87\% of masked content in the AI-generated summaries, demonstrating the effectiveness of the proposed masking and reconstruction strategy.
Moreover, when masking as much as 50\% of sensitive words, the ROUGE-1 quality of multiple summarization models shows an average decrease of only 20\% and a drop of about 0.04 in BERTScore compared to the original text summaries.
Paper Type: Long
Research Area: Summarization
Research Area Keywords: extractive summarisation; abstractive summarisation;evaluation
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources
Languages Studied: English
Submission Number: 9963
Loading