Adapting Pretrained Vision Transformers from 2D to 3D for Cryo-ET Classification

Yuzhou Wang; Xingjian Li; Min Xu

Adapting Pretrained Vision Transformers from 2D to 3D for Cryo-ET Classification

Yuzhou Wang, Xingjian Li, Min Xu

Published: 09 Oct 2025, Last Modified: 01 Nov 2025NeurIPS 2025 Workshop ImageomicsEveryoneRevisionsBibTeXCC BY 4.0

Submission Track: Short papers presenting ongoing research or work submitted to other venues (up to 5 pages, excluding references)

Keywords: Cryo-ET, subtomogram classification, transfer learning, 2D-to-3D adaptation, vision transformers, computational biology

TL;DR: We adapt 2D image-pretrained Vision Transformers to 3D subtomograms and achieve state-of-the-art performance on both simulated and real Cryo-ET classification datasets.

Abstract: Cryogenic electron tomography (Cryo-ET) enables visualization of macromolecular structures in near-native environments, but the resulting subtomograms are noisy and difficult to classify with deep learning models. Although transfer learning has been attempted for Cryo-ET subtomogram tasks, leveraging large-scale image-pretrained Transformers has remained largely unexplored. In this work, we study how such Transformers can be adapted to Cryo-ET subtomogram classification. We propose a simple, effective framework that (i) adapts 2D pretrained Transformer model weights to 3D subtomograms via weight inflation, and (ii) denoises subtomograms with Difference of Gaussian filtering. On both simulated and real subtomogram datasets, our approach enables ViT-B and Swin-B to outperform randomly initialized transformers and strong video-pretrained baselines: the weight-inflated Swin-B achieves 90.70% (simulated) and 99.29% (real), while weight-inflated ViT-B reaches 85.40% and 97.32\%, respectively. These results demonstrate that carefully adapted image-pretrained Transformers provide a strong and practical solution for Cryo-ET subtomogram classification.

Submission Number: 66

Loading