Confirmation: I have read and agree with the workshop's policy on behalf of myself and my co-authors.
Track: long paper (4–8 pages excluding references)
Keywords: single-cell transcriptomic data; large-scale pretraining; JEPA
Abstract: Learning robust representations from large-scale single-cell transcriptomic data is essential for understanding cellular heterogeneity, yet most self-supervised approaches operate in only one of two spaces. Reconstruction-based methods effectively denoise gene expression but impose no constraints on the embedding space, while contrastive methods organize embeddings but do not explicitly denoise inputs. Here, we introduce scJEPA, a dual-space self-supervised framework: denoising reconstruction captures global structure in expression space, while cross-view latent prediction organizes the embedding space by enforcing that masked views share consistent representations. Thereby, it retains only predictable biological signals while discarding view-specific noise such as dropout and batch effects. Crucially, each scJEPA objective operates in a distinct space with a specialized role: denoising guarantees information preservation, while latent prediction determines which information is retained. We systematically evaluate scJEPA against reconstruction-based and contrastive methods under large-scale pretraining settings. Across diverse tasks, including zero-shot cell-type retrieval, classification on held-out datasets, cross-batch integration, and transfer to spatial transcriptomics, scJEPA consistently outperforms single-space objectives. Our results demonstrate that jointly learning both in data and embedding spaces provides representations that better capture cellular properties.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 63
Loading