Learning Extremely Sparse Signals in High-Dimensional Cell-Free DNA Data Using Modern Hopfield Attention for Colorectal Cancer Detection

Published: 23 May 2026, Last Modified: 23 May 2026SD4H ICML 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: High-dimensional Health Data, Multi-resolution Signal, Early Cancer Detection, Deep Learning for Genomics, Attention Mechanisms, Multiple Instance Learning, Modern Hopfield Networks, Associative Memory, Interpretability, Liquid Biopsy, Cell-free DNA, DNA Methylation, Circulating Tumor DNA, Next-Generation Sequencing
TL;DR: We propose FLDL, an end-to-end supervised multiple instance learning framework for colorectal cancer detection from cfDNA that uses Modern Hopfield attention to detect rare tumor signals in sparse, high-dimensional, multi-resolution molecular data
Abstract: Next generation sequencing-based early detection of colorectal cancer from cell-free DNA (cfDNA) is a clinically important supervised learning problem on complex molecular data with extraordinarily sparse signal. It presents an extreme-scale multiple instance learning challenge: identifying rare tumor signals from high-dimensional, multi-resolution data with millions of instances per sample and witness rates as low as $<$0.0001\%. We propose Fragment-Level Deep Learning (FLDL), an end-to-end deep learning framework utilizing Modern Hopfield Networks to perform dense associative retrieval over the massive instance space. Using held-out real-world clinical and challenging contrived test sets, we compare FLDL's performance to a state-of-the-art machine learning model and to a deep learning model without attention (max pooling) . Our results demonstrate that only the attention-based FLDL model outperforms the machine learning model, in spite of a modest training set size ($n = 4,394$). FLDL also scales effectively with sample size and with number of instances per sample while offering useful biological insights due to the interpretability of its attention weights and intermediate learned representations. This work establishes a new frontier for highly scalable, attention-based deep learning in the field of clinical cfDNA diagnostics.
Submission Number: 109
Loading