From Perfect AUC to Poor Transfer: Diagnosing Leakage in Cross-Platform Gene Signature Learning

From Perfect AUC to Poor Transfer: Diagnosing Leakage in Cross-Platform Gene Signature Learning

05 Feb 2026 (modified: 02 Mar 2026)Submitted to Sci4DL 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: cross-platform transcriptomics, microarray-to-RNA-seq transfer, feature selection, distribution shift, multi-objective optimization, NSGA-II, Maximum Mean Discrepancy (MMD), stability selection, Kuncheva index, TCGA-BRCA, leak-free evaluation

TL;DR: cross-platform transcriptomics, microarray-to-RNA-seq transfer, feature selection, distribution shift, multi-objective optimization, NSGA-II, Maximum Mean Discrepancy (MMD), stability selection, Kuncheva index, TCGA-BRCA, leak-free evaluation

Abstract: Deep models often fail under distribution shift, yet the role of feature selection in amplifying or mitigating shift is underexplored. We study this in a stringent setting: transferring a tumour--vs--normal classifier across measurement platforms (Agilent microarray $\rightarrow$ RNA--Seq) using the same patients and genes. We introduce SCOPES, a leak--free, multi--objective selection framework that optimizes three competing goals: (i) predictive performance (AUC) via patient--safe cross--validation, (ii) selection stability (Kuncheva), and (iii) cross--platform alignment (Maximum Mean Discrepancy, MMD). On matched TCGA--BRCA Agilent/RNA--Seq, a label--informed $F$--score slab produced an implausibly perfect source model ($\mathrm{AUC} \approx 1.0$) but lost $\sim 0.30$ AUC after transfer, revealing selection leakage plus platform shift. Replacing the slab with an unsupervised MAD prefilter makes the trade--off explicit on the Pareto front: a one--gene, alignment--first solution achieves modest AUC with small transfer loss ($0.69 \rightarrow 0.61$, $\Delta \mathrm{AUC} \approx -0.08$), while a 30--gene, accuracy--first solution reaches near--perfect source AUC but transfers poorly ($\Delta \mathrm{AUC} \approx -0.38$). SCOPES provides a simple protocol to measure and control this trade--off (report source/target AUC, $\Delta \mathrm{AUC}$, and MMD), encouraging selections near a Pareto ``knee'' for portability. Finally, in the reverse direction (RNA--Seq $\rightarrow$ microarray), a 37-gene SCOPES signature attains $\mathrm{AUC}_{\mathrm{RNA}} = 0.654$ (CV) and $\mathrm{AUC}_{\mathrm{Agilent}} = 0.890$ ($\Delta \mathrm{AUC} = +0.236$), indicating directional shift. We argue that treating selection as a multi--objective design problem is a useful lens for the science of deep learning under shift.

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Style Files: I have used the style files.

Submission Number: 120

Loading