Keywords: CT, Foundation Models, Self-Supervised Learning, 3D DINOv2
TL;DR: We introduce task-agnostic 3D CT foundation models that deliver strong frozen-feature performance and require minimal fine-tuning.
Abstract: Existing foundation models (FMs) in the medical domain often require extensive fine-tuning or the training of resource-intensive decoders, while other encoders are pretrained with objectives biased toward specific tasks. This highlights the need for a strong foundation model baseline with minimal fine-tuning beyond feature extraction. In this work, we present task-agnostic pretraining of CT foundation models (TAP-CT): a simple yet effective adaptation of Vision Transformers (ViTs) and DINOv2 to volumetric data, enabling scalable self-supervised pretraining directly on 3D CT volumes. Our approach introduces a set of targeted adaptations to patch embeddings, positional encodings, and volumetric augmentations that make the architecture depth-aware while preserving the simplicity of the underlying architectures. Subsequently, we demonstrate that large-scale 3D pretraining on an extensive in-house CT dataset (105K volumes) enables strong generalization through stable and robust frozen feature representations. We will release all pretrained models, experimental configurations, and downstream benchmark code (https://huggingface.co/fomofo/tap-ct-b-3d) to promote transparency and reproducibility, and to establish a powerful, low-resource baseline for future research in medical imaging.
Primary Subject Area: Foundation Models
Secondary Subject Area: Application: Radiology
Registration Requirement: Yes
Reproducibility: https://huggingface.co/fomofo/tap-ct-b-3d
Visa & Travel: No
Read CFP & Author Instructions: Yes
Originality Policy: Yes
Single-blind & Not Under Review Elsewhere: Yes
LLM Policy: Yes
Submission Number: 59
Loading