Extracting Medical Information Using Machine Reading


Nov 17, 2018 AKBC 2019 Conference Blind Submission readers: everyone Show Bibtex
  • Keywords: Semantic Role Labelling, Information Extraction, Triple Extraction, Entity Linking, Triple Scoring
  • Abstract: A wealth of medical knowledge has been encoded using semantically rich languages like RDF and OWL leading to medical Knowledge Bases (KBs) like SNOMED CT, NCI, FMA, and more. Nevertheless, medical information like new treatments, drug-disease interactions, or relations between diseases, symptoms, and risk factors is initially published in the form of unstructured text requiring a considerable amount of time and resources until these are incorporated in existing KBs. In this paper we present techniques we developed for extracting medical facts (triples) from unstructured sources. Our approach follows the Machine Reading paradigm and more precisely Semantic Role Labelling (SRL) based extraction which is fully unsupervised. We show how we dealt with several deficiencies of SRL-based information extraction (IE), like entity linking with large arguments, copula verbs that are treated as first-class relations, inability to identify relations expressed through nouns, and the lack of scoring of extracted triples. Regarding scoring, we evaluate several methods which are based on existing off-the-self KB learning algorithms but also develop our custom classifier after some facts were validated by medical professional as accepted/rejected. We have applied our approach on unstructured sources and extracted about 120k triples. A random list of 5k were carefully validated by medical professional showing encouraging acceptance rate for an unsupervised approach. Our set of triples was also compared with a manually constructed network of diseases, symptoms and risk factors that is intended to be used for symptom-checking. The comparison showed that a large part of this network was extracted highlighting the usefulness of approach and the possibilities for using our set of triples for further extending and revising this network.
  • Archival status: Archival
  • Subject areas: Natural Language Processing, Information Extraction, Knowledge Representation, Semantic Web
0 Replies