Abstract: The extractive automatic summarization method is capable of quickly and efficiently generating summaries through the steps of scoring, extracting and eliminating redundant sentences. Currently, most extractive methods utilize deep learning technology to treat automatic summarization as a binary classification task. However, the effectiveness of automatic summarization for Chinese long text is limited by the maximum input length of the model, and it requires a large amount of training data. This paper proposes an unsupervised extractive automatic summarization method which solves the long text encoding problem by incorporating contextual semantics into sentence-level encoding. Firstly, we obtain the semantic representation of sentences by using the RoBERTa model. Secondly, we propose an improved k-Means algorithm to cluster sentence representations. By defining sparse and dense clusters, we improve the accuracy of summary sentence selection while preserving maximum semantic information from the original text. Experimental results on the CAIL2020 dataset show that our method outperforms baselines by 6.64/7.68/7.14% respectively on ROUGE-1/2/L. Moreover, we further enhance the automatic summarization results by 4.5/5.36/3.24% by adding domain rules tailored to the dataset’s characteristics.
External IDs:dblp:conf/adma/ZhuDYPF23
Loading