- TL;DR: In this work multi-document, extractive summaries have been obtained using supervised learning algorithms in the well-known DUC 2002 corpus
- Keywords: extractive summarization, supervised learning, embeddings
- Abstract: In this work multi-document, extractive summaries have been obtained using supervised learning algorithms. The methodology has three steps: the pre-processing step which filters irrelevant words and reduces vocabulary using stemming, the representation step which transforms sentences into vectors and the classification step which selects sentences for the summary. We used pre-trained sentence embeddings, naïve Bayes classifier, and artificial neural networks. The performance measures were recall, precision, accuracy and F1 score for the classifiers and ROUGE-n measure for quality of the summaries. The performance of the classifiers was higher than 70% in accuracy, precision, recall, and F1 score; whilst the summary quality surpassed the state of the art. Nevertheless, the classifier performance is not related to the summary quality because different measures were used to quantify summary quality and classifier performance; sentence embeddings and word overlapping were used in classifier task and summary quality, respectively. It is important to highlight that n-gram overlapping is the basis of measures for intrinsic summary quality, such as ROUGE-n, which is relevant while comparing performance between different works in the state of the art. We believe that using word embedding combined with n-grams as inputs to the classifiers is an interesting direction for further research.