From free-text electronic health records to structured cohorts: Onconum, an innovative methodology for real-world data mining in breast cancer

Published: 01 Jan 2023, Last Modified: 06 Jun 2025Comput. Methods Programs Biomed. 2023EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Highlights•Real-world data is embedded in Electronic Health Records in an unstructured format (text) which is not directly exploitable for analysis.•The two main approaches to mine data are based on pre-existing dictionaries and on machine learning algorithms.•These methods have two main pitfalls: they are less suited for other languages than English or need a significant amount of manual annotation of text for pattern recognition, which is time-consuming.•We developed a software platform called Onconum, based on a hybrid method associating machine learning approaches and rule-based lexical methods, able to extract and structure information from free-text and unstructured electronic health records, independently of any pre-existing dictionary and of any language in a breast cancer context.•Based on multidisciplinarity between clinicians and data scientists, this methodology uses iterative entanglement of automated tasks and medical feedback, allowing implementation in any language or disease.•This methodology allows data mining in electronic health records not otherwise exploitable with a wide variety of investigable parameters and with sufficient quality.
Loading