\section{Extended Discussion}
\label{appendix:discussion}

High-resolution medical images can result in large data storage costs and increased or intractable computational complexity for trained models. As the volume of data stored by hospitals continues to increase and large-scale foundation models become more commonplace, methods for inexpensively storing and efficiently processing high-resolution medical images become a critical necessity. In this work, we aim to address this need by introducing Med-VAE, a family of 6 large-scale autoencoders for medical images developed using a novel two-stage training procedure. Med-VAE encodes high-resolution medical images as downsized latent representations. We demonstrate with extensive evaluations that (1) downsized latent representations can effectively replace high-resolution images in CAD pipelines while maintaining or exceeding performance, (2) downsized latent representations reduce storage requirements (up to 512x) and improve downstream efficiency (up to 70x in model throughput) when compared to high-resolution input images, and (3) reconstructed images effectively preserve relevant features necessary for clinical interpretation by radiologists.

Several prior works have introduced powerful autoencoders capable of generating downsized latents for images. In particular, recent work on latent diffusion models has involved the development of several large-scale autoencoders, such as VQ-GANs and VAEs, trained on eight million natural images \cite{rombach2022high,kingma2013vae,esser2021taming,openimages}; downsized latents generated by these models were shown to capture relevant spatial structure as well as improve efficiency of downstream diffusion model training \cite{rombach2022high}. However, recent works have demonstrated that models trained on natural images often generalize poorly to medical images due to significant distribution shift \cite{guan2022,van2023exploring,chambon2022adapting}, suggesting that existing natural image autoencoders may not be well-suited for the complexity of the medical image domain. Our evaluations on both latent representations and reconstructed images support this point, demonstrating that existing large-scale natural image autoencoders consistently underperform our domain-specific medical image autoencoders. These findings demonstrate the need for domain-specific models capable of understanding complex and fine-grained patterns across diverse imaging modalities and anatomical regions.

Our work aims to reduce computational costs associated with automated medical image interpretation by proposing the use of training datasets comprised of downsized Med-VAE latent representations rather than high-resolution medical images. For instance, given a chest X-ray training dataset with images of size $1024 \times 1024$ with 1 channel, our 2D Med-VAE model with $f=64$ and $C=1$ can generate downsized latent representations of size $128 \times 128$ with 1 channel, contributing to substantial downstream efficiency and storage benefits. We demonstrate with eight CAD tasks that latent representations do not result in the loss of clinically-important information; at a 2D downsizing factor of $f=16$ and a 3D downsizing factor of $f=64$, we observe equivalent or better performance than high-resolution images with substantial improvements over multiple existing downsizing methods. Med-VAE models can also generalize beyond the images included in the training set, as shown by performance on 2D musculoskeletal X-rays and 3D spine CTs. Importantly, the efficiency benefits of using latent representations are significant; in particular, using latent representations can contribute to large increases in batch sizes, which can be particularly useful in the modern era of self-supervised foundation models that rely heavily on the use of large batch sizes during training. 

The Med-VAE autoencoder family includes two 3D autoencoders that are explicitly designed to downsize 3D medical imaging modalities (e.g. CT, MRI), a previously underresearched setting. Our results demonstrate that at a 3D downsizing factor of $f=64$, the volumetric latent representations generated by 3D Med-VAE are substantially higher quality than those generated by stitching together 2D slices downsized using 2D baselines. This suggests that 3D autoencoders can better capture clinically-important volumetric patterns, such as fractures that span multiple slices. Efficiency benefits in the 3D setting are also notable, particularly since training downstream CAD models on high-resolution 3D volumes is often computationally expensive or intractable. At significantly higher downsizing factors ($f=512$), we observe the benefits of 3D autoencoder training to be less pronounced, suggesting that users will need to carefully consider the tradeoffs between latent representation quality and desired downstream efficiency when selecting a Med-VAE model.

In addition to generating high-quality latent representations, Med-VAE models also include a trained decoder, which can reconstruct the original high-resolution image from the downsized latent. This is a particularly useful capability in the medical imaging domain, since high-resolution images are necessary for effective clinical interpretation by radiologists. We demonstrate with a reader study consisting of three radiologists that reconstructed images can effectively preserve clinically-relevant signal needed for diagnoses; in this setting, fine-grained fractures in chest X-rays were preserved through the encoding and decoding process.

Our study presents several opportunities for future work. First, additional research into model architectures, data augmentation approaches, and training strategies would be useful for building effective downstream CAD models that can learn from latent representations. In addition, the batch size and efficiency benefits afforded by latent representations raise the possibility of training large-scale foundation models using downsized latent representations. Whereas foundation models traditionally require significant computational resources and training time, utilizing downsized latent representations that preserve diagnostic features can greatly accelerate model training, particularly in resource-constrained settings. Future work can explore foundation model performance and scaling laws in this context. Finally, future work can explore additional autoencoder training strategies to better preserve clinically-relevant features at high downsizing factors. 

Overall, our work demonstrates the potential that large-scale, generalizable autoencoders hold in addressing critical storage and efficiency challenges in the medical domain. 
