One Leaf Knows Autumn: A Piece of Data-Model Facilitates Efficient Cancer Prognosis with Histological and Genomic Modalities

TMLR Paper3062 Authors

24 Jul 2024 (modified: 20 Nov 2024)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: The rapidly emerging field of computational pathology enables integrated image-omic solutions for cancer prognosis by jointly modeling both histological and genomic data. However, current multi-modal techniques suffer from three major bottlenecks: ($1$) $\underline{Memory Overheads}$, since a raw histology image typically has a super high resolution, e.g., $203,183\times91,757$ in cancer $\texttt{HNSC}$. Simple patch partitioning trades training time for spaces. ($2$) $\underline{Massive Computing Costs}$, due to immense parameter counts in recent state-of-the-art models, which demand substantial computational resources. Meanwhile, their intrinsic representation redundancy in vanilla-trained networks leads to an ineffective usage of the capacity. ($3$) $\underline{Gradient Conflicts}$, because there are significant heterogeneities between image and genomic data modalities, resulting in the disagreement of optimization directions. In this work, we propose an effective multi-modal pipeline for cancer prognosis, i.e., $\texttt{CancerMoE}$, to address the aforementioned challenges. Specifically, from data to model, it $\underline{first}$ designs a dynamic patch selection algorithm to flexibly score and locate informative patches online, trimming down the memory cost; $\underline{then}$ introduces a Sparse Mixture-of-Experts (SMoE) framework to disentangle weight spaces and allocate the most relevant model pieces to an input sample, promoting training efficiency and synergistic optimization among multiple modalities; $\underline{finally}$, consolidates and scarifies redundant attention heads, leading to improved efficiency and interpretability. Our extensive experiments demonstrate that $\texttt{CancerMoE}$ achieves competitive performance on $\textbf{twelve}$ cancer datasets compared to previous methods. Meanwhile, our proposed network architecture requires only $\textbf{1\%}$ of the image patches, $\textbf{20\%}$ of the model parameters, and $\textbf{30\%}$ of the merged attentions compared with the vanilla transformer network. Key codes are provided in detail in the supplement.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Hongsheng_Li3
Submission Number: 3062
Loading