Continual Panoptic Perception: Towards Multi-modal Incremental Interpretation of Remote Sensing Images

Published: 20 Jul 2024, Last Modified: 30 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Continual learning (CL) breaks off the one-way training manner and enables a model to adapt to new data, semantics and tasks continuously. However, current CL methods mainly focus on single tasks. Besides, CL models are plagued by catastrophic forgetting and semantic drift since the lack of old data, which often occurs in remote-sensing interpretation due to the intricate fine-grained semantics. In this paper, we propose Continual Panoptic Perception (CPP), a unified continual learning model that leverages multi-task joint learning covering pixel-level classification, instance-level segmentation and image-level perception for universal interpretation in remote sensing images. Concretely, we propose a collaborative cross-modal encoder (CCE) to extract the input image features, which supports pixel classification and caption generation synchronously. To inherit the knowledge from the old model without exemplar memory, we propose a task-interactive knowledge distillation (TKD) method, which leverages cross-modal optimization and task-asymmetric pseudo-labeling (TPL) to alleviate catastrophic forgetting. Furthermore, we also propose a joint optimization mechanism to achieve end-to-end multi-modal panoptic perception. Experimental results on the fine-grained panoptic perception dataset validate the effectiveness of the proposed model, and also prove that joint optimization can boost sub-task CL efficiency with over 13% relative improvement on PQ.
Primary Subject Area: [Content] Vision and Language
Secondary Subject Area: [Content] Vision and Language
Relevance To Conference: This work explores continual learning in remote sensing images via multi-task and multi-modal learning and proposes Continual Panoptic Perception. It breaks through the bottleneck of current continual learning approaches that can only handle a single-task interpretation. Considering the continuously arriving data in remote-sensing observation systems, the interpretation models are often requested to have multi-task, multi-modal incremental interpretation capabilities. In this paper, the proposed CPP model expands the scope of image interpretation from a single modality to panoptic perception covering pixel classification, instance segmentation and caption generation, enabling an effective exploration of multi-task and multi-modal interpretation of remote sensing images.
Submission Number: 139
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview