Abstract: Scene Graph Generation (SGG), given an image, is the task of building directed graphs where edges represent predicted <subject - predicate - object> triplets. Most SGG models struggle to identify important and descriptive relations in images flooding the graph with triplets like <window - on - building>. This is not due to training problems but rather the lack of saliency in fully supervised SGG datasets. Hence, observing that annotators describing an image naturally omit background relations and encode image saliency we (i) introduce a generalized method for training SGG models with weak supervision using image captions, (ii) introduce two variations of the Recall@N metric which can quantify the saliency of SGG models and (iii) perform quantitative and qualitative comparisons with related literature in VG200, where we achieve up to 35 % improvement compared to re-implementation of the SOTA.
0 Replies
Loading