Keywords: panoptic segmentation, semantic segmentation, convolutional networks, mobile models
TL;DR: Universal panoptic segmentation model with pure convolutions tailored for mobile devices.
Abstract: Universal panoptic segmentation models have achieved state-of-the-art quality by using transformers for predicting masks.
However, in mobile applications, transformer models are not computation-friendly due to the quadratic complexity with respect to the input length. In this work, we present MaskConver, a unified panoptic and semantic segmentation model with pure convolutions, which is optimized for mobile devices. We propose a novel lightweight mask embedding decoder to predict mask embeddings. These mask embeddings are used to predict a set of binary masks for both things and stuff classes. MaskConver achieves \textbf{37.2\%} panoptic quality score on COCO validation set, which is \textbf{6.4\%} better than Panoptic DeepLab with the same MobileNet backbone.
After mobile-specific optimizations, MaskConver runs at \textbf{30} FPS and delivers 29.7\% panoptic quality score on a Pixel 6, making it a real-time model, which is 10$\times$ faster than Panoptic DeepLab using the same backbone.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Applications (eg, speech processing, computer vision, NLP)
5 Replies
Loading