BigBio: A Framework for Data-Centric Biomedical Natural Language ProcessingDownload PDF

Published: 17 Sept 2022, Last Modified: 03 Jul 2024NeurIPS 2022 Datasets and Benchmarks Readers: Everyone
Keywords: biomedical, natural language processing, data-centric ai, language modeling
TL;DR: BigBio is a community library of 126+ biomedical NLP datasets, covering 13 tasks and 10 languages.
Abstract: Training and evaluating language models increasingly requires the construction of meta-datasets -- diverse collections of curated data with clear provenance. Natural language prompting has recently lead to improved zero-shot generalization by transforming existing, supervised datasets into a variety of novel instruction tuning tasks, highlighting the benefits of meta-dataset curation. While successful in general-domain text, translating these data-centric approaches to biomedical language modeling remains challenging, as labeled biomedical datasets are significantly underrepresented in popular data hubs. To address this challenge, we introduce BigBio a community library of 126+ biomedical NLP datasets, currently covering 13 task categories and 10+ languages. BigBio facilitates reproducible meta-dataset curation via programmatic access to datasets and their metadata, and is compatible with current platforms for prompt engineering and end-to-end few/zero shot language model evaluation. We discuss our process for task schema harmonization, data auditing, contribution guidelines, and outline two illustrative use cases: zero-shot evaluation of biomedical prompts and large-scale, multi-task learning. BigBio is an ongoing community effort and is available at https://github.com/bigscience-workshop/biomedical
URL: https://github.com/bigscience-workshop/biomedical
Dataset Url: Instructions for installing and downloading BigBIO are available on our project GitHub page: https://github.com/bigscience-workshop/biomedical
Author Statement: Yes
License: Apache 2.0
Supplementary Material: pdf
Contribution Process Agreement: Yes
In Person Attendance: Yes
Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 14 code implementations](https://www.catalyzex.com/paper/bigbio-a-framework-for-data-centric/code)
11 Replies

Loading