RelationLMM: Large Multimodal Model as Open and Versatile Visual Relationship Generalist

Published: 01 Jan 2025, Last Modified: 15 May 2025IEEE Trans. Pattern Anal. Mach. Intell. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Visual relationships are crucial for visual perception and reasoning, and cover tasks like Scene Graph Generation, Human-Object Interaction, and object affordance. Despite significant efforts, this field still suffers from the following limitations: specialists for a specific task without considering similar ones, strict and complex task formulations with limited flexibility, and underexploited reasoning with language and knowledge. To solve these limitations, we seek to build a new framework, one model for all tasks, over Large Multimodal Models (LMMs). LMMs offer the potential of unifying tasks, flexible forms, and reasoning with language. However, they fail to handle visual relationship tasks well. We find the obstacles include the conflicts between different tasks and insufficient instance-level information. We solve these problems by reforming the data for LMMs, rather than architectures, considering their strong language-in language-out capability. We propose to disassemble tasks into simple and common sub-tasks, verbally estimate instance confidence, and augment instance diversity, all without additional modules. These strategies help us build a visual relationship generalist, RelationLMM, with a simple architecture. Exhaustive experiments demonstrate RelationLMM is strong, generalizable and flexible to different tasks, with one model and one suite of weight.
Loading