\section{Extended Methods}
\label{appendix:methods}

\subsection{Background}
In this section, we provide background information on autoencoders. 

2D autoencoding methods can be formulated as follows. We begin with a training dataset $\mathcal{D} = \{x_i\}_{i=1}^N$ consisting of $N$ high-resolution input images $x_i \in \mathcal{X}$. Each high-resolution image $x_i$ has dimensions $H \times W$ with $B$ channels, which can be expressed as $x_i \in \mathbb{R}^{H \times W \times B}$. An autoencoding method learns an encoding function $g: \mathcal{X} \rightarrow \mathcal{Z}$, where $\mathcal{Z}$ represents a low-dimensional latent space and $z_i \in \mathcal{Z}$ represents the downsized latent representation corresponding to the input $x_i$. Let $f$ represent the downsizing factor applied to the 2D area of the image; then, the latent representation $z_i$ can be expressed as $z_i \in \mathbb{R}^{(H/(\sqrt{f}) \times (W/\sqrt{f}) \times C}$, where $C$ is a pre-specified number of latent channels. Autoencoding methods also learn a decoding function $h: \mathcal{Z} \rightarrow \hat{\mathcal{X}}$, which reconstructs the image $\hat{x_i}$ from the latent representation $z_i$. The encoding and decoding functions $g$ and $h$ are optimized in an end-to-end manner with the goal of maximizing perceptual similarity between $x_i$ and $\hat{x_i}$.

3D autoencoding methods follow a similar formulation, where each image $x_i$ represents a 3D volume with dimensions $H \times W \times S$ with $B$ channels. Here, the downsizing factor $f$ is applied to the 3D volume of the image; as a result, the latent representation $z_i$ can be expressed as $z_i \in \mathbb{R}^{(H/(\sqrt[3]{f}) \times (W/\sqrt[3]{f}) \times (S/(\sqrt[3]{f}) \times C}$, where $C$ is a pre-specified number of latent channels. 

\subsection{Curating a large-scale training dataset}
We first collect a large-scale, open-source training dataset $\mathcal{D}$ for training medical image autoencoders. We incorporate diverse modalities and anatomical features in order to ensure that trained autoencoders gain proficiency in processing the wide variety of diagnostic features that occur in medical images. Our dataset consists of 1,021,356 2D images and 31,374 3D images obtained from 19 multi-institutional, open-source datasets.

2D images include chest X-rays and FFDMs, selected because (a) chest X-rays are well-studied with large amounts of publicly-available data and (b) FFDMs are a challenging class of images due to large dimensions and the presence of fine-grained features critical for diagnoses (e.g. microcalcifications). We collect images from two chest X-ray datasets and six FFDM datasets \cite{johnson2019mimic,feng2021candid,jeong2022emory,sorkhei2021csaw,rsnamammo,nguyen2022vindrmammo,moreira2012inbreast,cai2023online}. 

3D images include head MRIs, knee MRIs, and high-resolution whole-body (head, neck, abdomen, chest, lower limb) CTs. We selected these datasets since (a) head MRIs/CTs are a commonly obtained examination, and (b) high-resolution CTs tend to contain subtle features and consume large amounts of storage. These images were curated from four T1- and T2-weighted head MRI datasets (14,296), one knee MRI dataset (3,564), two head/neck CT datasets (10,156), two whole-body CT datasets (1,434), and two chest CT datasets (1,924) \cite{jack2008alzheimer,dagley2017harvard,insel2020a4,lamontagne2019oasis,bien2018deep,hooper2021impact,chilamkurthy2018development,wasserthal2023totalsegmentator,ji2022amos,armato2011lung,stanfordaimi_coca_2024}.

\subsection{Training autoencoders for medical images}
In this section, we discuss our two-stage approach for training generalizable autoencoders for medical images. Motivated by prior work on natural images \cite{rombach2022high}, we elect to use variational autoencoders (VAEs) as our backbone. In the first stage of training, we optimize for reconstruction quality by maximizing perceptual similarity between the input image $x$ and the reconstructed image $\hat{x}$. Whereas existing works train autoencoders solely using this approach, the medical image domain introduces the added complexity of subtle, fine-grained features required for clinical interpretation of images; thus, we introduce a second stage of training, where the latent representation space $\mathcal{Z}$ is refined with continued fine-tuning. Our approach is intended to explicitly preserve diverse clinically-relevant features in both latent representations and reconstructed images. In total, the Med-VAE family includes four 2D VAEs and two 3D VAEs trained with various downsizing factors. \\

\noindent \textbf{Stage 1: Training Base Autoencoders} (Fig.~\ref{fig:method}a). 
We begin by performing base training of the autoencoders using the collected 2D images in order to optimize the quality of reconstructions $\hat{x}$. In line with prior work \cite{rombach2022high}, each Med-VAE autoencoder learns an encoder and decoder (corresponding to functions $g$ and $h$) end-to-end using a fully convolutional VAE. Each Med-VAE autoencoder accepts single-channel, high-resolution medical images $x_i$ as input, applies function $g$ to transform the input to a downsized latent representation $z_i$, and then applies function $h$ to reconstruct the original image $\hat{x_i}$. Med-VAE models are characterized by two hyperparameters: $f$, which represents the downsizing factor applied to the 2D area of the input image, and $C$, which describes the number of channels included in the latent representation. For instance, given an input image $x_i$ of size $H \times W \times 1$, a Med-VAE model with $f = 16$ and $C = 3$ would generate a latent representation $z_i$ of size $(H/4) \times (W/4) \times 3$, downsizing the image area by 16x and adding two additional channels. The reconstructed image $\hat{x_i}$ would be of size $H \times W \times 1$.

In order to learn functions $g$ and $h$, the VAE is trained to maximize the similarity between $x_i$ and $\hat{x_i}$ using a perceptual loss term \cite{lpips} and a patch-based adversarial objective \cite{isola2018patchgan}. Additionally, in order to ensure preservation of clinically-relevant features within the reconstructed image, we introduce a domain-specific embedding consistency loss based on BiomedCLIP, a pretrained vision-language foundation model trained on a large corpus of paired medical image-text data \cite{zhang2023biomedclip}. During training, we apply an $L_2$ penalty between BiomedCLIP embeddings corresponding to the input image $x_i$ and the reconstructed image $\hat{x_i}$. This loss function is inspired by prior work on developing autoencoders for chest X-rays \cite{lee2023llmcxr}. Finally, in addition to the loss functions listed above, a KL-divergence penalty is applied to the latent sample in order to pull latents towards a standard normal; the penalty is assigned a low weight of 1e-6. 

We use the above loss functions and the curated dataset of one million 2D images to train the following four base autoencoders, trained across various downsizing factors and latent channels. Implementation details for each base model is described below:
\begin{itemize}
\item \textbf{2D Base Autoencoder (Stage 1) with $f=16$ and $C=1$}: This autoencoder yields latent representations $z_i$ of size $(H/4) \times (W/4) \times 1$. Stage 1 training is performed from scratch. The VAE is trained solely with the perceptual loss, the KL-divergence penalty, and the BiomedCLIP embedding consistency loss for the first 3125 steps; then, the patch-based adversarial objective is applied. We train for 100K steps using 8 NVIDIA A100 GPUs and a batch size of 32. 
\item \textbf{2D Base Autoencoder (Stage 1)  with $f=16$ and $C=3$}: This autoencoder yields latent representations $z_i$ of size $(H/4) \times (W/4) \times 3$. We first initialize the VAE with weights from a previously-developed natural image autoencoder (KL-VAE) \cite{rombach2022high}. Then, we perform Stage 1 training using LoRA \cite{hu2021lora} with rank=4 applied to all 2D convolutional layers. We train with all four loss functions for 50k steps using 8 A100 GPUs and a batch size of 32. 
\item \textbf{2D Base Autoencoder (Stage 1)  with $f=64$ and $C=1$}: This autoencoder yields latent representations $z_i$ of size $(H/8) \times (W/8) \times 1$. Stage 1 training is performed from scratch. The VAE is trained solely with the perceptual loss, the KL-divergence penalty, and the BiomedCLIP embedding consistency loss for the first 3125 steps; then, the patch-based adversarial objective is applied. We train for 100K steps using 8 NVIDIA A100 GPUs and a batch size of 32. 
\item \textbf{2D Base Autoencoder (Stage 1)  with $f=64$ and $C=4$}: This autoencoder yields latent representations $z_i$ of size $(H/8) \times (W/8) \times 4$. We first initialize the VAE with weights from a previously-developed natural image autoencoder (KL-VAE) \cite{rombach2022high}. Then, we perform Stage 1 training using LoRA \cite{hu2021lora} with rank=4 applied to all 2D convolutional layers. We train with all four loss functions for 50k steps using 8 A100 GPUs and a batch size of 32. 
\end{itemize}

\noindent \textbf{Stage 2: Preserving Clinically-Relevant Features Across Modalities} (Fig.~\ref{fig:method}b). After performing base training of the autoencoders using the collected 2D images, we introduce a second stage of training intended to further refine the latent space such that clinically-relevant features are preserved across various modalities. 

In the context of 2D imaging modalities, the second training stage takes the form of a lightweight fine-tuning procedure designed to maximize consistency in clinically-relevant features between the input image and the latent representation. Our key insight here is that image embeddings generated by BiomedCLIP~\cite{zhang2023biomedclip} can effectively capture clinically-relevant features in 2D medical images, suggesting utility as a guidance mechanism during training\footnote{We use the \texttt{BiomedCLIP-PubMedBERT\_256-vit\_base\_patch16\_224} model available on HuggingFace at \href{https://huggingface.co/microsoft/BiomedCLIP-PubMedBERT_256-vit_base_patch16_224}{https://huggingface.co/microsoft/BiomedCLIP-PubMedBERT\_256-vit\_base\_patch16\_224}.}. We freeze all parameters in the encoder and decoder of the VAE. During training, the input image $x_i$ is passed through the frozen VAE encoder to generate the latent representation $z_i$; then, $z_i$ is passed through a series of lightweight, trainable projection layers, which yield an output representation $\Bar{z_i}$ with the same size as $z_i$. Let the function $b(\cdot)$ represent the BiomedCLIP embedding function. We optimize the projection layer weights using a domain-specific embedding consistency loss, which takes the form of an $L_2$ loss between $b(x_i)$ and $b(\Bar{z_i})$. All downstream evaluations of latent representation quality are performed with the projected latent $\Bar{z_i}$. We perform Stage 2 training using the curated 2D training dataset with one million images. Our procedure yields four 2D Med-VAE autoencoders with various downsizing factors and number of latent channels:
\begin{itemize}
\item \textbf{2D Med-VAE with $f=16$ and $C=1$}: The projection layers generate $\Bar{z_i}$ of size $(H/4) \times (W/4) \times 1$. Stage 2 training is performed for 50K steps using 8 NVIDIA A100 GPUs and a batch size of 32. 
\item \textbf{2D Med-VAE with $f=16$ and $C=3$}: The projection layers generate $\Bar{z_i}$ of size $(H/4) \times (W/4) \times 3$. Stage 2 training is performed for 50K steps using 8 NVIDIA A100 GPUs and a batch size of 32. 
\item \textbf{2D Med-VAE with $f=64$ and $C=1$}: The projection layers generate $\Bar{z_i}$ of size $(H/8) \times (W/8) \times 1$. Stage 2 training is performed for 60K steps using 8 NVIDIA A100 GPUs and a batch size of 32. 
\item \textbf{2D Med-VAE with $f=64$ and $C=4$}: The projection layers generate $\Bar{z_i}$ of size $(H/8) \times (W/8) \times 4$. Stage 2 training is performed for 50K steps using 8 NVIDIA A100 GPUs and a batch size of 32. 
\end{itemize}

In the context of 3D imaging modalities (e.g. CT scans, MRIs), the second training stage involves lifting the 2D VAE architecture to 3D using a kernel centering inflation strategy \cite{zhang2022adapting}; we then continue training with 3D images. We note here that using external 2D medical foundation models like BiomedCLIP to enforce feature consistency is inadequate for 3D settings. As a result, we instead implement a training procedure focused on maximizing perceptual similarity, analogous to 2D stage 1 training. We train the 3D autoencoders using random cubic patches of size $64 \times 64 \times 64$. The perceptual loss and the patch-based adversarial objective are calculated per-slice, with the final loss term computed as the mean across all slices in the volume. Following such a training strategy, a 3D Med-VAE model with $f = 64$, $C = 1$, and input image $x_i$ of size $H \times W \times S \times 1$ would generate a latent representation $z_i$ of size $(H/4) \times (W/4) \times (S/4) \times 1$, downsizing the volume by 64x. We perform Stage 2 training using the curated dataset of 31,374 3D images. Our procedure yields two 3D Med-VAE autoencoders across various downsizing factors: 
\begin{itemize}
\item \textbf{3D Med-VAE with $f=64$ and $C=1$}: The latent representations $z_i$ are of size $(H/4) \times (W/4) \times (S/4) \times 1$. We initialize the VAE with weights from 2D Base Autoencoder (Stage 1) with $f=16$ and $C=1$. We then train the VAE for 35K steps using 4 NVIDIA A6000 GPUs and a batch size of 32.
\item \textbf{3D Med-VAE with $f=512$ and $C=1$}: The latent representations $z_i$ are of size $(H/8) \times (W/8) \times (S/8) \times 1$. We initialize the VAE with weights from 2D Base Autoencoder (Stage 1) with $f=64$ and $C=1$. We then train the VAE for 140K steps using 1 NVIDIA A6000 GPU and a batch size of 8. Both 3D Med-VAEs are trained for the same number of steps when accounting for batch size. 
\end{itemize}


