WeakSAM: Segment Anything Meets Weakly-supervised Instance-level Recognition

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Weakly-supervised visual recognition using inexact supervision is a critical yet challenging learning problem. It significantly reduces human labeling costs and traditionally relies on multi-instance learning and pseudo-labeling. This paper introduces WeakSAM and solves the weakly-supervised object detection (WSOD) and segmentation by utilizing the pre-learned world knowledge contained in a vision foundation model, i.e., the Segment Anything Model (SAM). WeakSAM addresses two critical limitations in traditional WSOD retraining, i.e., pseudo ground truth (PGT) incompleteness and noisy PGT instances, through adaptive PGT generation and Region of Interest (RoI) drop regularization. It also addresses the SAM's shortcomings of requiring human prompts and category unawareness in object detection and segmentation. Our results indicate that WeakSAM significantly surpasses previous state-of-the-art methods in WSOD and WSIS benchmarks with large margins, i.e. average improvements of 7.4% and 8.5%, respectively.
Primary Subject Area: [Content] Media Interpretation
Relevance To Conference: The proposed method, WeakSAM, advances data-efficient multimedia processing, i.e., Weakly-supervised Instance Recognition for Visual Input, essential for ACM MM's focus on innovative analysis methods. This model enhances the dense perception of multimedia visual inputs using minimal supervision, a significant benefit given the high costs and efforts associated with extensive annotations. WeakSAM solves the weakly-supervised object detection (WSOD) and segmentation by utilizing the pre-learned world knowledge contained in a vision foundation model, i.e., the Segment Anything Model (SAM). WeakSAM addresses two critical limitations in traditional WSOD retraining, i.e., pseudo ground truth (PGT) incompleteness and noisy PGT instances, through adaptive PGT generation and Region of Interest (RoI) drop regularization. It also addresses the SAM's shortcomings of requiring human prompts and category unawareness in object detection and segmentation. These improvements significantly boost performance on weakly-supervised object detection (WSOD) and segmentation (WSIS) benchmarks, demonstrating substantial advancements in using sparsely labeled data for complex recognition tasks. This capability aligns perfectly with ACM MM's goals, showcasing foundational model adaptations for improved multimedia understanding with limited data.
Supplementary Material: zip
Submission Number: 1969
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview