Keywords: Decoupled Energy‑based Models; Graph Structure Refinement; Vision Graph Neural Networks
Abstract: Vision Graph Neural Networks (ViG) treats an image as a set of visual patches for graph representation learning and yields promising results across various computer vision tasks. However, most existing works primarily focus on static graph construction, ignoring the performance gains and noise reduction benefits of dynamic structure refinement. Meanwhile, generative models such as Energy-based Models (EBMs) are generally unsuitable for discriminative tasks and struggle with large-scale images. Our goal is to introduce a unified generative-discriminative paradigm for dynamically modeling relationships between visual patches, aiming to produce higher-quality representations for improving downstream tasks. Specifically, we propose **D**ecoupled **E**nergy **L**earning (DEL) that defines a joint distribution of sample pairs to approximate the target distribution. It decouples EBMs into energy matching and contrastive learning as a global loss function, which pulls similar pairs closer and pushes dissimilar pairs further apart in the representation space. For implementation, we develop an end-to-end framework, termed **D**ecoupled **EN**ergy learning guided **S**tructure r**E**finement for improving **ViG** (DenseViG). Structure refinement is deployed within ViG architectures in a plug-and-play manner, dynamically adding or pruning edges based on similarity metrics with a relaxation strategy. Theoretical analyses demonstrate the effectiveness of DenseViG in processing large datasets through graph operations. Empirical evaluations confirm that it outperforms state-of-the-art methods on three major benchmarks, achieving 84.3\% Top-1 accuracy on ImageNet-1K, 46.4\% mAP on MS COCO, and 50.9\% mIoU on ADE20K.
Primary Area: learning on graphs and other geometries & topologies
Submission Number: 2513
Loading