\section{Introduction}
\label{sec:intro}
Medical images (e.g. X-rays, computed tomography (CT) scans) are essential diagnostic tools in clinical practice. Since medical conditions are often characterized by the presence of subtle features, images are generally acquired with high spatial resolution and large fields of view in order to capture the required level of diagnostic detail for interpretation by radiologists \cite{huda2015resolution}. However, high-resolution medical images, especially volumetric (3D) images, can result in large data storage costs and increased or even intractable computational complexity for downstream computer-aided diagnosis (CAD) models \cite{freire2022computational, tan2019efficientnet}. This is likely to become a significant concern in the near future due to the rapid growth of medical imaging volumes stored by hospitals \cite{mesterhazy2020high}, the expanding use of CAD tools in clinics \cite{engin2020,najjar2023}, and paradigm shifts towards large-scale foundation models \cite{bommasani2022opportunities,chen2024chexagentfoundationmodelchest,merlin}. Many existing CAD models address this challenge by interpolating images to lower resolutions, despite the lower performance of models trained on interpolated data~\cite{sabottke2020effect, huang2023self}. 

% ******* Figure ********
\begin{figure}[t]
\captionsetup{format=plain}
\centering
\includegraphics[width=\textwidth]{figures/method.pdf}
\caption{We introduce MedVAE, a suite of large-scale autoencoders capable of downsizing medical images to latent representations and decoding latent representations back to images.}
\label{fig:method}
\end{figure}



A promising solution lies in powerful autoencoder methods, which are capable of encoding images as downsized latent representations and decoding latent representations back to images. Recent works, particularly in the context of latent diffusion models, have demonstrated that downsized latent representations can capture relevant spatial structure from high-resolution input images while simultaneously improving efficiency on tasks such as image generation \cite{rombach2022high}. These findings suggest that autoencoders may hold potential for addressing the aforementioned storage and efficiency challenges in the medical domain by encoding high-resolution images as downsized latent representations, which can be used to develop downstream CAD models at a fraction of the computational cost.

Several large-scale autoencoders have been introduced in recent years \cite{rombach2022high,lee2023llmcxr}; however, directly applying these models to the medical domain is challenging since medical images include a diverse range of clinically-relevant features (e.g. tumors, lesions, fractures), anatomical regions of focus (e.g. head, chest, knee), and modalities (e.g. 2D and 3D images). An effective generalizable autoencoding approach in the medical image domain must operate across a wide range of medical images and preserve clinically relevant features in both downsized latents as well as decoded reconstructions. However, existing autoencoder models are either (a) developed for natural images \cite{rombach2022high}, which represent a significant domain shift from medical images, or (b) developed for a focused set of medical images (e.g. chest X-rays) \cite{lee2023llmcxr} and are not explicitly trained to preserve clinically-relevant features across diverse medical images.

In this work, we address these limitations by introducing MedVAE, a family of 6 large-scale, generalizable 2D and 3D autoencoder models developed for the medical image domain. We first curate a large-scale training dataset with over one million 2D and 3D images, and we perform model training using a novel two-stage training scheme designed to optimize quality of latent representations and decoded reconstructions. 

We evaluate the quality of latent representations (using 8 CAD tasks) and reconstructed images (using both automated and manual perceptual quality evaluations) with respect to the preservation of clinically-relevant features. Evaluations are derived from 20 multi-institutional, open-source medical datasets with 4 imaging modalities  (X-ray, full-field digital mammograms, CT, and MRI) and 8 anatomical regions. We measure the extent to which MedVAE latent representations and reconstructed images can contribute to downstream storage and efficiency benefits while simultaneously preserving clinically-relevant features. Ultimately, our results demonstrate that (1) downsized MedVAE latent representations can be used as drop-in replacements for high-resolution images in CAD pipelines while maintaining or exceeding performance; (2) downsized latent representations reduce storage requirements (up to 512x) and improve downstream efficiency of CAD model training (up to 70x in model throughput) when compared to high-resolution input images; and (3) decoded reconstructions effectively preserve clinically-relevant features as verified by an expert reader study. Our results also demonstrate that MedVAE models outperform existing natural image autoencoders. 

Ultimately, we demonstrate the potential that large-scale, generalizable autoencoders hold in addressing the critical storage and efficiency challenges currently faced by the medical domain. Utilizing MedVAE latent representations instead of high-resolution images in training pipelines can improve model efficiency while preserving clinically-relevant features.
