Unifying Pixel-Labeling Vision Tasks by Sequence Modeling

Unifying Pixel-Labeling Vision Tasks by Sequence Modeling

TMLR Paper786 Authors

20 Jan 2023 (modified: 17 Sept 2024)Rejected by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Developing a single neural network that can perform a wide range of tasks is an active area of research in computer vision. However, unifying models for pixel-labeling tasks presents significant challenges due to the diverse forms of outputs and the reliance on task-specific structures. In this paper, we propose UniPixTask, a model that unifies pixel-labeling tasks by modeling them with discrete vocabulary codes. UniPixTask consists of two main components: a Sequence Learner that produces a unified expression and a Dense Decoder that constructs latent codes for dense prediction outputs. This combination enables UniPixTask to model the complex task space using unified discrete representations while maintaining high-quality output for pixel-labeling tasks. We evaluate UniPixTask on three pixel-labeling tasks: semantic segmentation, surface normal estimation, and monocular depth estimation. The experimental results show that UniPixTask is a promising approach for unifying pixel-labeling tasks and is competitive with established task-specific models.

Submission Length: Regular submission (no more than 12 pages of main content)

Changes Since Last Submission: Detailed comparisons with previous related works (Unified-IO) are provided. Recent task-specific approaches and computational costs of different components are added. Other details are modified according to review comments.

Assigned Action Editor: ~Simon_Kornblith1

Submission Number: 786

Loading