Abstract: Multi-task dense prediction plays an important role in the field of computer vision and has an abundant array of applications. Its main purpose is to reduce the amount of network training parameters by sharing network parameters while using the correlation between tasks to improve overall performance. We propose a task-conditional network that handles one task at a time and shares most network parameters to achieve these goals. Inspired by adapter tuning, we propose an adapter module that focuses on both spatial- and channel-wise information to extract features from the frozen encoder backbone. This approach not only reduces the number of training parameters, but also saves training time and memory resources by attaching a parallel adapter pathway to the encoder. We additionally use learnable task prompts to model different tasks and use these prompts to adjust some parameters of adapters to fit the network to diverse tasks. These task-conditional adapters are also applied to the decoder, which enables the entire network to switch between various tasks, producing better task-specific features and achieving excellent performance. Extensive experiments on two challenging multi-task benchmarks, NYUD-v2 and PASCAL-Context, show that our approach achieves state-of-the-art performance with excellent parameter, time, and memory efficiency. The code is available at https://github.com/jfzleo/Task-Conditional-Adapter
Primary Subject Area: [Content] Media Interpretation
Secondary Subject Area: [Experience] Multimedia Applications
Relevance To Conference: Multi-task dense prediction aims to perform multiple dense prediction tasks using a single network, including semantic segmentation, edge detection, depth estimation, and so on. Its application includes autonomous driving, robotics, virtual reality, etc. The network we propose shares the vast majority of parameters across tasks and modulates the entire network for different tasks. This approach can capture better task-specific features, thereby enabling a better understanding, processing, and reasoning of various tasks under visual modality. By combining visual understanding with knowledge reasoning, the network can provide a richer, more nuanced interpretation of multimedia content.
Submission Number: 4866
Loading