Keywords: Scene Graph Generation, Language Models, Zero-Shot Learning, Prompt Learning, Visual Relationship Detection
TL;DR: LM-ANet boosts long-tail and zero-shot predicate prediction in SGG by projecting visual features into a frozen BERT space and casting predicate classification as subject-object-aware masked language modeling with prototype matching.
Abstract: Scene graph generation suffers from severe bias in long-tail and zero-shot predicate prediction.We propose the Language Model-Aided Network (LM-ANet), which repurposes a completely frozen 110M-parameter BERT-base as a zero-shot predicate predictor without any fine-tuning or task-specific classifier.It achieves this by projecting visual tokens into BERT's embedding space and casting predicate classification as cloze-style masked language modeling with subject-object-aware prompts, followed by prototype-based matching.
On Visual Genome, LM-ANet achieves 23.8% zero-shot Recall@100 on PredCls (improving the prior best by 2.9 percentage points), 7.1% on SGCls, and 6.8% on SGDet, while obtaining mean Recall@100 of 23.4% on PredCls and 14.9% on SGCls—both highly competitive with recent methods.Using a completely frozen 110M-parameter BERT-base without any fine-tuning of the language model, LM-ANet delivers zero-shot and long-tail performance comparable to or surpassing many recent approaches based on significantly larger vision-language models, while exhibiting strong generalization to Open Images v6.
Submission Number: 60
Loading