Abstract: Relation Extraction in the biomedical domain is a challenging task due to the lack of labeled data and the long-tail distribution of the entity mentions. Recent works propose distant supervision as a way to tackle the scarcity of annotated data by automatically pairing knowledge graph relationships with raw textual data. In several benchmarks, Distantly Supervised Biomedical Relation Extraction (Bio-DSRE) models can produce very accurate results. However, given the challenging nature of the task, we set out to investigate the validity of such impressive results. We probed the datasets used by \citet{amin2020data} and \citet{hogan2021abstractified} and found a significant overlap between training and evaluation relationships that, once resolved, reduced the accuracy of the models by up to 71\%. Furthermore, we noticed several inconsistencies along the data construction process, such as the creation of negative samples and improper handling of redundant relationships. To mitigate these issues we present \meddistant, a new benchmark dataset obtained by aligning the MEDLINE abstracts with the widely used SNOMED-Clinical Terms (SNOMED-CT) knowledge base. We experimented with several state-of-the-art models following our methodology, showing that there is still plenty of room for improvement for the task. We release our code and data for reproducibility.
Paper Type: long
0 Replies
Loading