Multi-document Summarization in Medical Literature using PICO-Masking Approach


Abstract: Multi-document summarization is essential for capturing key information from vast medical literatures. The dataset of this domain typically comprises a triple of a background, documents and a summary where background describes clinical research question or topics shared by related documents. To summarize based on a background while accommodating multiple documents, existing approaches typically reduce text units through truncation disregarding potential summary-relevant information. Others perform extract-then-generate approaches at document-level or sentence-level which could struggle to capture the relevant evidence since document-level extraction is excessively broad and sentence-level extraction is overly granular and noisy. To address the aforementioned problems, we combine two extraction levels and propose to frame the problem as query-focused summarization where background represents a query. Specifically, we decompose the problem into two stages 1) \textit{relevant evidence extraction} (i.e. finding relevant evidence within a set of relevant documents with regards to the shared background) 2) \textit{summary generation} (i.e. generating summaries based on the relevant evidence). To represent background as a query, we introduce a PICO-masking approach to mask the given background and consider it as a \textit{proxy query} for our extraction model. In particular, PICO-masking masks elements that are mnemonic for the important parts of a well-built clinical question. This enforces extraction model to understand the context in order to identify the evidence from documents that belong to the masked background, hence help locate relevant evidence before generating a summary. Results show that our approach achieves state-of-the-art performance on MS2 dataset despite having multiple stages.
