Open Peer Review. Open Publishing. Open Access. Open Discussion. Open Directory. Open Recommendations. Open API. Open Source.
DuoRC: Towards Complex Language Understanding with Paraphrased Reading Comprehension
Nov 03, 2017 (modified: Nov 03, 2017)ICLR 2018 Conference Blind Submissionreaders: everyoneShow Bibtex
Abstract:We propose DuoRC, a novel dataset for Reading Comprehension (RC) that motivates several new challenges for neural approaches in language understanding beyond those offered by existing RC datasets. DuoRC contains 186,089 unique question-answer pairs created from a collection of 7680 pairs of movie plots where each pair in the collection reflects two versions of the same movie - one from Wikipedia and the other from IMDb - written by two different authors. We asked crowdsourced workers to create questions from one version of the plot and a different set of workers to extract or synthesize corresponding answers from the other version. This unique characteristic of DuoRC where questions and answers are created from different versions of a document narrating the same underlying story, ensures by design, that there is very little lexical overlap between the questions created from one version and the segments containing the answer in the other version. Further, since the two versions have different level of plot detail, narration style, vocabulary, etc., answering questions from the second version requires deeper language understanding and incorporating background knowledge not available in the given text. Additionally, the narrative style of passages arising from movie plots (as opposed to typical descriptive passages in existing datasets) exhibits the need to perform complex reasoning over events across multiple sentences. Indeed, we observe that state-of-the-art neural RC models which have achieved near human performance on the SQuAD dataset, even when coupled with traditional NLP techniques to address the challenges presented in DuoRC exhibit very poor performance (F1 score of 37.42% on DuoRC v/s 86% on SQuAD dataset). This opens up several interesting research avenues wherein DuoRC could complement other Reading Comprehension style datasets to explore novel neural approaches for studying language understanding.
TL;DR:We propose DuoRC, a novel dataset for Reading Comprehension (RC) containing 186,089 human-generated QA pairs created from a collection of 7680 pairs of parallel movie plots and introduce a RC task of reading one version of the plot and answering questions created from the other version; thus by design, requiring complex reasoning and deeper language understanding to overcome the poor lexical overlap between the plot and the question.