CCQA: A New Web-Scale Question Answering Dataset for Model Pre-Training

Anonymous

CCQA: A New Web-Scale Question Answering Dataset for Model Pre-Training

Anonymous

08 Mar 2022 (modified: 05 May 2023)NAACL 2022 Conference Blind SubmissionReaders: Everyone

Paper Link: https://openreview.net/forum?id=4CwYXIpRYe0

Paper Type: Long paper (up to eight pages of content + unlimited references and appendices)

Abstract: We propose a novel open-domain question-answering dataset based on the Common Crawl project. With a previously unseen number of around 130 million multilingual question-answer pairs (including about 60 million English data-points), we use our large-scale, natural, diverse and high-quality corpus to in-domain pre-train popular language models for the task of question-answering. In our experiments, we find that our Common Crawl Question Answering dataset (CCQA) achieves promising results in zero-shot, low resource and fine-tuned settings across multiple tasks, models and benchmarks.

Presentation Mode: This paper will be presented in person in Seattle

Copyright Consent Signature (type Name Or NA If Not Transferrable): Patrick Huber

Copyright Consent Name And Address: Meta, 1 Hacker Way, Menlo Park, CA, 94025

0 Replies

Loading