DLAGF: Motion-Queried Cross-Attention Transformer Framework for Multimodal Cardiomyocyte Ageing Detection and Early Heart Failure Risk

Published: 24 Nov 2025, Last Modified: 24 Nov 20255th Muslims in ML Workshop co-located with NeurIPS 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Cardiomyocyte ageing, iPSC-derived cardiomyocytes, Multi-modal fusion, Motion-queried cross-attention Transformer, RNA-seq gene expression, Optical flow motion analysis, DLAGF module, Early heart failure risk stratification
TL;DR: A compact motion-queried cross-attention Transformer fuses imaging, motion, functional metrics, and RNA-seq to detect early cardiomyocyte ageing, outperforming single-modality and generic multimodal baselines with minimal compute overhead
Abstract: Cardiomyocyte ageing leads to heart failure, yet early detection is difficult with imaging alone. The research introduces a simple and non-invasive multimodal model that combines visual and gene expression data to detect signs of ageing in heart cells. The model uses a compact cross-attention Transformer with a Dual-Level Attention-Gated Fusion (DLAGF) module to integrate four types of data: motion from brightfield videos, single images (morphology), contraction values (CSV), and reduced RNA-seq gene features. The model was trained and tested on 672 clips from 28 wells in 3 plates, using grouped-by-well splits to avoid data leakage (train/val/test = 70/15/15; test set = 101 clips). Our model achieves a macro F1 score of $0.861 \pm 0.011$, outperforming the use of motion only ($0.79 \pm 0.02$) by +7.4 \% accuracy and +0.07 macro F1 points. It also outperforms strong multimodal baselines, such as Perceiver IO (0.84 macro F1) and a symmetric multimodal Transformer (0.85 macro F1). These gains are statistically reliable and come with very little additional computation (only +0.15M parameters and +5\% latency). Ablation studies show that removing gene data drops performance to 0.82 macro F1. It achieves per-class AUCs above $0.92$, and the performance gains are statistically significant: paired bootstrap $\Delta\mathrm{F1} = 0.011$, $p = 0.004$; McNemar’s test $\chi^2 = 6.1$, $p = 0.013$. Visualisation of attention weights also shows a clear link between motion changes and key gene features. This framework provides an efficient method for detecting cell ageing early and is beneficial in drug testing or regenerative heart research. Given that ageing phenotypes precede overt cardiac dysfunction, this multimodal readout supports early heart failure risk stratification in vitro.
Submission Number: 63
Loading