Sparse Tokens for Dense Prediction - The Medical Image Segmentation Case

Lei Zhou; Huidong Liu; Joseph Bae; Junjun He; Dimitris Samaras; Prateek Prasanna

Sparse Tokens for Dense Prediction - The Medical Image Segmentation Case

Lei Zhou, Huidong Liu, Joseph Bae, Junjun He, Dimitris Samaras, Prateek Prasanna

22 Sept 2022 (modified: 13 Feb 2023)ICLR 2023 Conference Withdrawn SubmissionReaders: Everyone

Keywords: token pruning, vision transformer, dense prediction, medical image segmentation

Abstract: Can we use sparse tokens for dense prediction, e.g., segmentation? Although token sparsification has been applied to Vision Transformers (ViT) for acceleration on classification tasks, it is still unknown how to perform segmentation from sparse tokens. To this end, we reformulate segmentation as a sparse encoding -> token completion -> dense decoding (SCD) pipeline. We first show empirically that naively applying existing approaches from classification token pruning and masked image modeling (MIM) leads to failure and training inefficiency. This is caused by inappropriate sampling algorithms and the low quality of the restored dense features. In this paper, we propose Soft-topK Token Pruning (STP) and Multi-layer Token Assembly (MTA) to address the above problems. Particularly, in the sparse encoding stage, STP predicts token-wise importance scores with a lightweight sub-network and samples topK-scored tokens. The intractable gradients of topK are approximated through a continuous perturbed score distribution. In the token completion stage, MTA restores a full token sequence by assembling both sparse output tokens and pruned intermediate tokens from multiple layers. Compared to MIM which fills the pruned positions with mask tokens, MTA produces more informative representations allowing more accurate segmentation. The last dense decoding stage is compatible with decoders of existing segmentation frameworks, e.g., UNETR. Experiments show SCD pipelines equipped with our STP and MTA are much faster than baselines without token sparsification in both training (up to 120% higher throughput) and inference (up to 60.6% higher throughput) while maintaining segmentation quality.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Submission Guidelines: Yes

TL;DR: We show how to perform dense prediction efficiently with a sparse token ViT while maintaining performance.

Please Choose The Closest Area That Your Submission Falls Into: Applications (eg, speech processing, computer vision, NLP)

10 Replies

Loading