- Abstract: The open-domain question answering (OpenQA) task aims to extract answers that match specific questions from a distantly supervised corpus. Unlike supervised reading comprehension (RC) datasets where questions are designed for particular paragraphs, background sentences in OpenQA datasets are more prone to noise. We observe that most existing OpenQA approaches are vulnerable to noise since they simply regard those sentences that contain the answer span as ground truths and ignore the plausible correlation between the sentences and the question. To address this deficiency, we introduce a unified and collaborative model that leverages alignment information from query-sentence pairs in a small-scale supervised RC dataset and aggregates relevant evidence from distantly supervised corpus to answer open-domain questions. We evaluate our model on several real-world OpenQA datasets, and experimental results show that our collaborative learning methods outperform the existing baselines significantly.
- Keywords: natural language processing, open-domain question answering, semi-supervised learning
- TL;DR: We propose denoising strategies to leverage information from supervised RC datasets to handle the noise issue in the open-domain QA task.