TL;DR: Masked autoencoders produce worst reconstructions but learn the best atmospheric representations. Reconstruction fidelity is a poor proxy for representation utility in spectroscopic retrieval tasks
Abstract: Deep learning enables efficient compression of hyperspectral satellite observations, but the choice of self-supervised objective significantly impacts what information is preserved. We compare three paradigms on NASA's Tropospheric Emissions: Monitoring of Pollution (TEMPO) data: variational autoencoders (VAE), autoregressive generation (GIVT), and masked autoencoders (MAE). Our experiments reveal a trade-off between reconstruction fidelity and atmospheric information preservation. VAE and GIVT achieve near-perfect reconstruction (MSE $\sim$ 0.0002) but encode atmospheric products less effectively. MAE produces substantially worse reconstructions (MSE $\sim$ 0.02–0.33) yet consistently outperforms when retrieving NO$_2$, O$_3$, HCHO, and cloud fraction, with improvements up to 69\% for NO$_2$ retrieval over VAE at 64$\times$ compression (R$^2$ = 0.49 vs. 0.29). Aggregated across products, MAE improves over VAE by 17\% (MLP probes) to 32\% (linear probes). This trade-off diminishes at aggressive compression but grows as representational capacity increases. Our experiments provide empirical evidence that reconstruction quality is a poor proxy for representation utility in spectroscopic retrieval tasks, with direct implications for pretraining objective selection in climate foundation model design.
Submission Number: 44
Loading