TokenUnify: Scalable Autoregressive Pretraining for Large Scale EM Image Segmentation

Yinda Chen; Haoyuan Shi; Xiaoyu Liu; Te Shi; Ruobing Zhang; Dong Liu; Zhiwei Xiong; Feng Wu

TokenUnify: Scalable Autoregressive Pretraining for Large Scale EM Image Segmentation

Yinda Chen, Haoyuan Shi, Xiaoyu Liu, Te Shi, Ruobing Zhang, Dong Liu, Zhiwei Xiong, Feng Wu

27 Sept 2024 (modified: 24 Oct 2024)ICLR 2025 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: biological image, autoregressive visual pre-training

Abstract: Autoregressive next-token prediction, a standard pretraining method for large-scale language models, excels in handling long sequential data. However, its application to complex visual tasks, particularly biological imaging, faces challenges due to the spatial continuity and high dimensionality of biological images. High-resolution 3D biological images, such as electron microscopy (EM) brain scans, offer ideal long-sequence data, but existing methods struggle to fully leverage this characteristic. To address these challenges, we introduce \textbf{TokenUnify}, a novel pretraining method that integrates random token prediction, next-token prediction, and next-all token prediction. We provide theoretical evidence demonstrating that TokenUnify mitigates cumulative errors in visual autoregression, particularly when dealing with complex three-dimensional anatomical structures. In conjunction with TokenUnify, we have assembled a large-scale, ultra-high-resolution EM brain image dataset comprising over 120 million finely annotated voxels. This dataset not only represents the largest neuron segmentation dataset to date but, more importantly, provides ideal long-sequence biological image data that fully exhibits spatial continuity. Leveraging the Mamba network, which is inherently suited for long-sequence modeling, TokenUnify capitalizes on the advantages of autoregressive methods in processing long-sequence data, achieving a 45\% performance improvement on downstream EM neuron segmentation tasks compared to existing methods. Furthermore, TokenUnify demonstrates superior scalability over MAE and traditional autoregressive methods, effectively bridging the gap between pretraining strategies for language and vision models. Code is available at \url{https://anonymous.4open.science/r/TokenUnify-3DBF}.

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 9060

Loading