Diffusion Autoencoders with Perceivers for Long, Irregular and Multimodal Sequences

Diffusion Autoencoders with Perceivers for Long, Irregular and Multimodal Sequences

ICLR 2026 Conference Submission14631 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Perceiver, diffusion model, autoencoder, self-supervised learning

TL;DR: We proposed a diffusion autoencoder using perceiver architecture to handle long, irregular and multimodal sequences.

Abstract: Self-supervised learning has become a central strategy for representation learning, but majority of the successful architectures assume regularly sampled inputs such as images, audios. and videos. In many scientific domains---e.g., astrophysics data arrive as long, irregular, and multimodal sequences where existing methods might not handle natively. We introduce the Diffusion Autoencoder with Perceivers (daep), a diffusion autoencoder architecture designed for such settings. Our method tokenizes heterogeneous measurements, compresses them with a Perceiver encoder, and reconstructs them with a Perceiver-IO diffusion decoder, enabling scalable learning without assuming uniform sampling. For fair comparison, we also adapt masked autoencoders (MAE) with Perceivers, establishing a strong baseline in the same architectural family. Across spectral, photometric, and multimodal astronomical datasets, daep achieves lower reconstruction error and produces smoother, more discriminative latent spaces than VAE and perceiver-MAE baselines, particularly when preserving high-frequency structure is critical. Our results suggest daep provides a general framework for learning robust representations from irregular multimodal data, with applications potentially well beyond astronomy.

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Submission Number: 14631

Loading