MindShot: Multi-Shot Video Reconstruction from fMRI with LLM Decoding

wenwen zeng; Yonghuang Wu; Yifan Chen; Chengqian Zhao; Feiyu Yin; Xuan Xie; Guoqing Wu; Jinhua Yu

MindShot: Multi-Shot Video Reconstruction from fMRI with LLM Decoding

wenwen zeng, Yonghuang Wu, Yifan Chen, Chengqian Zhao, Feiyu Yin, Xuan Xie, Guoqing Wu, Jinhua Yu

19 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: fMRI-to-Video Reconstruction, Multi-Shot Video Reconstruction, fMRI-to-Text Decoding

TL;DR: MindShot pioneers multi-shot fMRI video reconstruction by explicitly decoupling mixed signals into shot-specific segments and decoding semantic keyframe captions via LLMs, enabling accurate recovery of complex visual narratives.

Abstract: Reconstructing dynamic videos from fMRI is important for understanding visual cognition and enabling vivid brain-computer interfaces. However, current methods are critically limited to single-shot clips with video-level alignment and reconstruction, failing to address the multi-shot nature of real-world experiences. To bridge this gap, we propose MindShot, a novel shot-level framework that effectively reconstructs multi-shot videos from fMRI via a divide-and-decode strategy. Specifically, our framework consists of three stages: (1) Shot Decomposition: We first identify shot boundaries within fMRI, then decompose the mixed signals into distinct, shot-specific segments. This explicit segmentation serves as the foundation for accurate semantics decoding. (2) Keyframe Decoding: Each segment is decoded into a textual description representing the keyframe of its corresponding shot. (3) Video Reconstruction: The final video is generated from these keyframe captions, effectively mitigating noise from fMRI redundancy. Addressing the critical lack of real data for multi-shot reconstruction, we introduce a large-scale synthetic dataset generated via a novel data augmentation strategy that randomizes scene duration ratios. Experimental results demonstrate our framework outperforms state-of-the-art methods in both single-shot and multi-shot reconstruction fidelity. Crucially, ablation studies confirm the necessity and generalizability of our Shot Boundary Predictor (SBP), where explicit shot-level decomposition significantly improves decoded caption CLIP similarity by 71.8\%, and the SBP yields consistent performance gains when integrated into other state-of-the-art architectures. Moreover, our synthetic data makes the model generalizable to diverse data and has strong zero-shot transferability that effectively bridges the domain gap between synthetic and real fMRI signals. This work establishes a new paradigm for multi-shot fMRI reconstruction, enabling accurate recovery of complex visual narratives through explicit decomposition and semantic prompting.

Primary Area: applications to neuroscience & cognitive science

Submission Number: 18015

Loading