Learn By An Example Transformer For Domain Generalization In Video Object Segmentation

Islam I. Osman, Mohamed S. Shehata

Published: 2024, Last Modified: 07 Nov 2025ICIP 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Video object segmentation is a challenging task in computer vision. In this task, a learning model should be able to segment and track a specific set of objects through a frame sequence. This set of objects is given from the ground truth of the first frame in a sequence. To achieve domain generalization in this task, the learning model must be trained using a massive, labeled dataset that has almost all kinds of objects that can be seen in any frame sequence. However, such a dataset does not exist because the labeling process for this task is so expensive, as it requires per-pixel labeling for each frame in a given frame sequence. In this paper, we propose a novel learning technique and transformer architecture. This novel learning technique allows the model to learn effectively from a small, labeled dataset. Additionally, the novel architecture allows the model to produce segmentation output as a function of an input example, instead of relying on memorizing the representation of all objects to be segmented. The experiments show the superiority of the proposed model in comparison with state-of-the-art models by $10.6 \%$ when evaluated using out-of-domain frame sequences.