UniVis: A Universal Framework for Computer Vision Tasks

Han Xue; Qianru Sun; Li Song; Wenjun Zhang; Zhiwu Huang

UniVis: A Universal Framework for Computer Vision Tasks

Han Xue, Qianru Sun, Li Song, Wenjun Zhang, Zhiwu Huang

22 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Primary Area: generative models

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: Universal framework; Diffusion models; Instruction tuning; In-context learning

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

TL;DR: We present a universal learning framework for a wide range of computer vision tasks from the perspective of generative modeling.

Abstract: We propose $\texttt{UniVis}$, a universal learning framework to tam a wide range of computer vision tasks, including visual understanding (e.g., semantic segmentation), low-level image processing (e.g., denoising), and conditional image generation (e.g., edge-to-image synthesis). Built on a large-scale pre-trained text-to-image diffusion model, $\texttt{UniVis}$ unifies various vision tasks through a general framework using instruction tuning, where its unifying ability comes from the generative and reasoning power of the pre-trained model. Specifically, $\texttt{UniVis}$ defines a general image completion task wherein the input consists of a pair of input-output images corresponding to the target task and a query image, and the aim is to generate the ''missing'' data paired to the query. The paired images play the role of image instruction defining the task, e.g., semantic segmentation is represented by an RGB image and its segmentation mask. Our rationale is that each computer vision task can be characterized by its unique input-output pair, which informs our $\texttt{UniVis}$ model about the expected output for the given query. Furthermore, a task-level or instance-level prompt can be optionally added to provide text instruction. By unifying various visual tasks, $\texttt{UniVis}$ has the advantage of minimizing the inductive bias inherent in designing models for individual tasks, and it also suggests that the understanding of different visual tasks can be achieved through a shared generative model. In experiments, $\texttt{UniVis}$ showcases impressive performance on a bunch of standard computer vision benchmarks including ten tasks in total. The source code will be made publicly available.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 5574

Loading