Abstract: Reconstructing visual stimuli from brain activities is crucial for deciphering the underlying mechanism of the human visual system. While recent studies have achieved notable results by leveraging deep generative models, challenges persist due to the lack of large-scale datasets and the inherent noise from non-invasive measurement methods. In this study, we draw inspiration from the mechanism of human memory and propose BrainRAM, a novel two-stage dual-guided framework for visual stimuli reconstruction. BrainRAM incorporates a Retrieval-Augmented Module (RAM) and diffusion prior to enhance the quality of reconstructed images from the brain. Specifically, in stage I, we transform fMRI voxels into the latent space of image and text embeddings via diffusion priors, obtaining preliminary estimates of the visual stimuli's semantics and structure. In stage II, based on previous estimates, we retrieve data from the LAION-2B-en dataset and employ the proposed RAM to refine them, yielding high-quality reconstruction results. Extensive experiments demonstrate that our BrainRAM outperforms current state-of-the-art methods both qualitatively and quantitatively, providing a new perspective for visual stimuli reconstruction.
Primary Subject Area: [Generation] Generative Multimedia
Secondary Subject Area: [Content] Multimodal Fusion, [Engagement] Multimedia Search and Recommendation, [Experience] Multimedia Applications
Relevance To Conference: This work significantly contributes to the field of multimedia and multimodal processing by introducing a novel two-stage dual-guided model, BrainRAM, that leverages retrieval-augmented generation and diffusion priors for reconstructing visual stimuli from brain activities. The use of a retrieval-augmented approach, inspired by the human memory mechanism, addresses a critical challenge in multimedia processing: the integration of complex, multimodal data (image, text, and brain activity). By transforming fMRI voxels into a latent space of image and text embeddings, the model achieves a preliminary semantic and structural understanding of the visual stimuli. This is further refined in the second stage through the Retrieval-Augmentation Module (RAM), which incorporates images and captions retrieved from the large-scale LAION2B-en dataset. This dual-guided approach not only enhances the quality of the reconstructed images but also bridges the gap between raw brain signals and multimodal content generation. The methodology demonstrates a promising direction for multimedia processing, particularly in generating more accurate and semantically rich content from diverse data sources, thus pushing the boundaries of how multimodal information is synthesized and understood in the context of human cognitive processes.
Supplementary Material: zip
Submission Number: 3287
Loading