Abstract: Miscalibrated models tend to be unreliable and insecure for downstream applications. In this work, we attempt to highlight and remedy miscalibration in current scene graph generation (SGG) models, which has been overlooked by previous works. We discover that obtaining well-calibrated models for SGG is more challenging than conventional calibration settings, as long-tailed SGG training data exacerbates miscalibration with overconfidence in head classes and underconfidence in tail classes. We further analyze which components are explicitly impacted by the long-tailed data during optimization, thereby exacerbating miscalibration and unbalanced learning, including: \textbf{biased parameters}, \textbf{deviated boundaries}, and \textbf{distorted target distribution}. To address the above issues, we propose the \textbf{C}ompositional \textbf{O}ptimization \textbf{C}alibration (\textbf{COC}) method, comprising three modules: i. A parameter calibration module that utilizes a hyperspherical classifier to eliminate the bias introduced by biased parameters. ii. A boundary calibration module that disperses features of majority classes to consolidate the decision boundaries of minority classes and mitigate deviated boundaries. iii. A target distribution calibration module that addresses distorted target distribution, leverages within-triplet prior to guide confidence-aware and label-aware target calibration, and applies curriculum regulation to constrain learning focus from easy to hard classes. Extensive evaluation on popular benchmarks demonstrates the effectiveness of our proposed method in improving model calibration and resolving unbalanced learning for long-tailed SGG. Finally, our proposed method performs best on model calibration compared to different types of calibration methods and achieves state-of-the-art trade-off performance on balanced learning for SGG. The source codes and models will be available upon acceptance.
Primary Subject Area: [Experience] Multimedia Applications
Secondary Subject Area: [Content] Vision and Language
Relevance To Conference: Scene graph generation (SGG) is a critical task in multimedia and multimodal processing, providing structured representations of visual scenes that capture the relationships between objects, which can 1. incorporate information from multiple modalities, such as images, text, and audio, to facilitate more effective analysis and interpretation of multimedia content, to be applied to multimodal downstream tasks, e.g., video question-answering and video captioning. 2. serve as input for generative models to generate new multimedia content that adheres to specific semantic constraints, and 3. generate more relevant and accurate results for retrieval systems by indexing multimedia data based on the semantic information contained in scene graphs.
Given the importance of SGGs in multimedia and multimodal processing, it is crucial to create well-calibrated SGG models that produce reliable and unbiased scene graphs for downstream applications. We are the first to explore calibration in the context of scene graphs, where we establish a new benchmark and propose a compositional framework for SGG calibration. Extensive empirical evaluations demonstrate the effectiveness of our SGG model in alleviating miscalibration and unbalanced learning for long-tailed SGG.
Supplementary Material: zip
Submission Number: 487
Loading