Abstract: Developing a single neural network that can perform a wide range of tasks is an active area of research in computer vision. However, unifying models for pixel-labeling tasks presents significant challenges due to the diverse forms of outputs and the reliance on task-specific structures. In this paper, we propose UniPixTask, a model that unifies pixel-labeling tasks by modeling them with discrete vocabulary codes. UniPixTask consists of two main components: a Sequence Learner that produces a unified expression and a Dense Decoder that constructs latent codes for dense prediction outputs. This combination enables UniPixTask to model the complex task space using unified discrete representations while maintaining high-quality output for pixel-labeling tasks. We evaluate UniPixTask on three pixel-labeling tasks: semantic segmentation, surface normal estimation, and monocular depth estimation. The experimental results show that UniPixTask is a promising approach for unifying pixel-labeling tasks and is competitive with established task-specific models.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: Detailed comparisons with previous related works (Unified-IO) are provided.
Recent task-specific approaches and computational costs of different components are added.
Other details are modified according to review comments.
Assigned Action Editor: ~Simon_Kornblith1
Submission Number: 786
Loading