Pretrained Medical Representations for the Practical Screening of Drug Repositioning Candidates

Yuhei Fujioka; Daitaro Misawa; Shingo Fukuma

Pretrained Medical Representations for the Practical Screening of Drug Repositioning Candidates

Yuhei Fujioka, Daitaro Misawa, Shingo Fukuma

Published: 30 May 2026, Last Modified: 11 Jun 2026ICML2026-AI4Science PosterEveryoneRevisionsBibTeXCC BY 4.0

Additional Submission Instructions: For the camera-ready version, please include the author names and affiliations, funding disclosures, and acknowledgements.

Track: Track 1: Original Research/Position/Education/Attention Track

Keywords: Representation Learning, Healthcare, Drug Repositioning, Self-Supervised Learning, Hypothesis Generation

Abstract: Representation learning from medical code sequences in electronic health records and medical claims data has been successful in various clinical applications, such as those regarding disease prediction. However, significant challenges remain in extending this approach to the discovery of scientific hypotheses. One reason is that many existing BERT-based models fail to adequately capture the hierarchical structure of medical codes and the complex interactions between diagnoses and treatments. To address these limitations, we propose a new unified pre-training framework that explicitly integrates hierarchical sub-token aggregation, partial masking, and cross-reference mechanisms. The proposed model consistently outperformed existing methods on both pre-training objectives and downstream clinical event prediction tasks, including the onset of dementia and hospitalization. We also conducted an $\textit{in silico}$ drug repositioning case study targeting Alzheimer’s disease. In the hypothesis generation step, our approach successfully rediscovered known promising drugs (e.g., pitavastatin) in a data-driven manner without relying on such external knowledge sources as the literature. Subsequently, in the hypothesis prioritization step, we introduced a Task-Adaptive Representation Approach to alleviate the over-encoding of historical prescription information within diagnostic vectors, enabling the robust prioritization of generated hypotheses. This study establishes an exploratory screening workflow for hypothesis generation and prioritization based on observational associations. Importantly, this framework is not intended to provide causal evidence, but rather to identify promising candidates for subsequent rigorous causal inference. Overall, this study demonstrates that domain-informed representation learning combined with task-adaptive representation control can enable a practical hypothesis discovery workflow, beyond a single application domain.

Submission Number: 43

Loading