Cross-Task Knowledge Transfer for Semi-supervised Joint 3D Grounding and Captioning

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 OralEveryoneRevisionsBibTeXCC BY 4.0
Abstract: 3D visual grounding is a fundamental but important task in multimedia understanding, which aims to locate a specific object in a complicated 3D scene semantically according to a text description. However, this task requires a large number of annotations of labeled text-object pairs for training, and the scarcity of annotated data has been a key obstacle in this task. To this end, this paper makes the first attempt to introduce and address a new semi-supervised setting, where only a few text-object labels are provided during training. Considering most scene data has no annotation, we explore a new solution for unlabeled 3D grounding by additionally training and transferring samples from a correlated task, i.e., 3D captioning. Our main insight is that 3D grounding and captioning are complementary and can be iteratively trained with unlabeled data to provide object and text contexts for each other with pseudo-label learning. Specifically, we propose a novel 3D Cross-Task Teacher-Student Framework (3D-CTTSF) for joint 3D grounding and captioning in the semi-supervised setting, where each branch contains parallel grounding and captioning modules. We first pre-train the two modules of the teacher branch with the limited labeled data for warm-up. Then, we train the student branch to mimic the ability of the teacher model and iteratively update both branches with the unlabeled data. In particular, we transfer the learned knowledge between the grounding and captioning modules across two branches to generate and refine the pseudo labels of unlabeled data for providing reliable supervision. To further improve the pseudo-label quality, we design a cross-task pseudo-label generation scheme, filtering low-quality pseudo-labels at the detection, captioning, and grounding levels, respectively. Experimental results on various datasets show competitive performances in both tasks compared to previous fully- and weakly-supervised methods, demonstrating the proposed 3D-CTTSF can serve as an effective solution to overcome the data scarcity issue.
Primary Subject Area: [Content] Vision and Language
Secondary Subject Area: [Engagement] Multimedia Search and Recommendation
Relevance To Conference: This paper addresses the task of 3D visual grounding, which is a multi-modal task involving point clouds, pictures and text modalities. Therefore, it is important and crucial for multimedia understanding. The goal of this task is to locate a specific object within a complex 3D scene based on a textual description. In particular, in this paper, we try to tackle the challenge of the scarcity of annotated data by introducing a semi-supervised method called 3D Cross-Task Teacher-Student Framework (3D-CTTSF).
Supplementary Material: zip
Submission Number: 1245
Loading