Controlled and Balanced Dataset for Japanese Lexical Simplification

Tomonori Kodaira, Tomoyuki Kajiwara, Mamoru Komachi

2016 (modified: 13 Nov 2022)ACL (Student Research Workshop) 2016Readers: Everyone

Abstract: We propose a new dataset for evaluating a Japanese lexical simplification method. Previous datasets have several deficiencies. All of them substitute only a single target word, and some of them extract sentences only from newswire corpus. In addition, most of these datasets do not allow ties and integrate simplification ranking from all the annotators without considering the quality. In contrast, our dataset has the following advantages: (1) it is the first controlled and balanced dataset for Japanese lexical simplification with high correlation with human judgment and (2) the consistency of the simplification ranking is improved by allowing candidates to have ties and by considering the reliability of annotators.

0 Replies