\section{Methods}
\label{sec:methods}
We now present our approach for training generalizable autoencoders for the medical image domain. Autoencoding methods are capable of encoding high-resolution images as downsized latent representations. For a given 2D input image with dimensions $H \times W$ with $B$ channels, an autoencoding method will output a downsized latent representation of size $H/(\sqrt{f}) \times (W/\sqrt{f}) \times C$. Here, $f$ represents the downsizing factor applied to the 2D area of the image and $C$ represents a pre-specified number of latent channels. 3D autoencoding methods follow a similar formulation, where input images are 3D in nature with dimensions $H \times W \times S$ with $B$ channels. Here, the downsizing factor $f$ is applied to the 3D volume of the image; as a result, the latent representation will have dimensions $(H/(\sqrt[3]{f}) \times (W/\sqrt[3]{f}) \times (S/(\sqrt[3]{f}) \times C$. Autoencoding methods are also capable of decoding latent representations back to reconstructed high-resolution images. 

We aim to develop large-scale, generalizable medical image autoencoders capable of preserving diverse clinically-relevant features in both latent representations and reconstructions. To this end, we first collect a large-scale training dataset with 1,021,356 2D images and 31,374 3D images curated from 19 multi-institutional, open-source datasets \cite{johnson2019mimic,feng2021candid,jeong2022emory,sorkhei2021csaw,rsnamammo,nguyen2022vindrmammo,moreira2012inbreast,cai2023online,jack2008alzheimer,dagley2017harvard,insel2020a4,lamontagne2019oasis,bien2018deep,hooper2021impact,chilamkurthy2018development,wasserthal2023totalsegmentator,ji2022amos,armato2011lung,stanfordaimi_coca_2024}. Images are obtained from two chest X-ray datasets, six full-field digital mammogram (FFDM) datasets, four T1- and T2-weighted head magnetic resonance imaging (MRI) datasets, one knee MRI dataset, two head/neck CT datasts, two whole-body CT datasets, and two chest CT datasets.



We utilize this dataset to train a family of generalizable autoencoders for medical images. Motivated by prior work on natural images~\cite{rombach2022high}, we utilize variational autoencoders (VAEs) as the model backbone. We perform model training using a novel two-stage training scheme designed to optimize quality of latent representations and decoded reconstructions. Specifically, the first stage involves training base autoencoders using 2D images (Fig.~\ref{fig:method}a); we maximize the perceptual similarity between input images and reconstructed images using a perceptual loss~\cite{lpips}, a patch-based adversarial objective~\cite{isola2018patchgan}, and a domain-specific embedding consistency loss. Whereas existing works on autoencoders train using only this stage, the medical image domain introduces the added complexity of subtle, fine-grained features required for clinical interpretation; thus, we introduce a second stage of training, which aims to further refine the latent space such that clinically-relevant features are preserved across various modalities (Fig.~\ref{fig:method}b). Specifically, in the context of 2D modalities (e.g. X-ray, FFDM), the second training stage leverages the embedding space of BiomedCLIP, a recently-developed medical foundation model~\cite{zhang2023biomedclip}, to enforce feature consistency between input images and latent representations. In the context of 3D modalities (e.g. CT, MRI), the second training stage involves lifting the 2D autoencoder architecture to 3D and performing continued fine-tuning with 3D images. In total, the MedVAE family includes 4 2D autoencoders and 2 3D autoencoders trained with various downsizing factors $f$ and latent channels $C$. Extended methods and implementation details are provided in Appendix \ref{appendix:methods}.