Relational Backbones for Visual Representation Learning
Abstract: Most recent vision backbones, including attention-based and graph-based architectures, model relations between image tokens through pairwise interactions. Although this design has proved effective, it provides only an indirect mechanism for representing semantic dependencies shared by multiple regions, neglecting crucial higher-order semantic correlations. We study visual representation learning from the perspective of high-order relational modeling and show that hypergraph message passing provides a strictly richer formulation than ordinary graph propagation under nonlinear feature updates. This observation motivates a compact general-purpose visual backbone in which image tokens are organized into adaptive hyperedges rather than processed only through dyadic connections. The proposed model first discovers semantic centers from patch features and uses them to induce multi-node correlation structures with spatial and topological constraints. It then performs anchor-based hypergraph feature propagation by exchanging information between region tokens and learned hyperedge representations through a lightweight mixed aggregation operator. A differential mixed hyperedge aggregation regularization strategy is further used to reduce instability in the learned structure. Experiments on ImageNet and downstream recognition benchmarks show that the resulting backbone improves the accuracy-efficiency trade-off over Vision Transformers and representative graph neural vision models, especially under low-parameter and low-FLOP settings.
Loading