ManufVisSGG: A Vision-Language-Model Approach for Cognitive Scene Graph Generation in Manufacturing Systems

Zhijie Yan, Zuoxu Wang, Shufei Li, Mingrui Li, Xinxin Liang, Jihong Liu

Published: 2024, Last Modified: 12 Nov 2025CASE 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: To establish a cognitive manufacturing system, scene graph generation (SGG) that lets machines/robots understand objects and their relations under varied scenarios is an essential task. Existing research on SGG primarily focuses on detection and panoptic segmentation approaches, where objects are identified through bounding boxes or panoptic segmentation, followed by the prediction of their pairwise relationships. This process means that the quality of the final scene graph predictions is heavily influenced by the quality of costly annotations. To tackle this issue, we propose the Manufacturing Visual Scene Graph Generation (ManufVisSGG) method, a simple yet powerful approach that leverages the capabilities of Vision-Language Models (VLMs) to generate scene graphs quickly and accurately without any additional object annotations. Furthermore, leveraging the ManufVisSGG method, we have implemented a meticulous annotation procedure to compile a high-quality manufacturing scene graph generation (MSG) dataset, comprising 10,000 images of manufacturing and other industrial scenes. Through comparisons with various scene graph generation methods and benchmarks across two other datasets, we have showcased the superiority of the ManufVisSGG method and underscored the benefits of the MSG dataset over existing datasets.

External IDs:dblp:conf/case/YanWLLLL24