A new dataset for multilingual keyphrase generationDownload PDF

05 Jun 2022, 17:22 (modified: 06 Oct 2022, 19:29)NeurIPS 2022 Datasets and Benchmarks Readers: Everyone
Keywords: Keyphrase generation, multilingual keyphrase generation, dataset, keyphrases
Abstract: Keyphrases are an important tool for efficiently dealing with the ever-increasing amount of information present on the internet. While there are many recent papers on English keyphrase generation, keyphrase generation for other languages remains vastly understudied, mostly due to the absence of datasets. To address this, we present a novel dataset called Papyrus, composed of 16427 pairs of abstracts and keyphrases. We release four versions of this dataset, corresponding to different subtasks. Papyrus-e considers only English keyphrases, Papyrus-f considers French keyphrases, Papyrus-m considers keyphrase generation in any language (mostly French and English), and Papyrus-a considers keyphrase generation in several languages. We train a state-of-the-art model on all four tasks and show that they lead to better results for non-English languages, with an average improvement of 14.2\% on keyphrase extraction and 2.0\% on generation. We also show an improvement of 0.4\% on extraction and 0.7\% on generation over English state-of-the-art results by concatenating Papyrus-e with the Kp20K training set.
Supplementary Material: pdf
URL: https://github.com/smolPixel/French-keyphrase-generation
Dataset Url: https://github.com/smolPixel/French-keyphrase-generation/tree/main/data The dataset is available in the folders Papyrus (corresponding to Papyrus-a), Papyrus-e, Papyrus-f, and Papyrus-m, in tsv format. Not curated data is available at https://github.com/smolPixel/French-keyphrase-generation/tree/main/Preprocessing, under the name dataset.jsonl
License: We release the dataset and code under the Creative Commons public license.
Author Statement: Yes
Contribution Process Agreement: Yes
In Person Attendance: Yes
14 Replies