Capturing Multi-Region Dependencies in Images via Hyperedge Routing Backbones
Abstract: While self-attention mechanisms have become the standard for visual representation, they predominantly compute 2-order similarities between isolated patches. This approach often overlooks the latent, higher-order semantic correlations among multiple image regions that are essential for accurate scene interpretation. To address this, we present a general-purpose vision backbone built entirely upon hypergraph topologies, bypassing the expressive limitations of standard graph neural networks. Our framework operates through a two-stage process: structural induction and feature propagation. We first employ a semantic centroid clustering approach to generate hyperedges that group related visual elements across the image space. Subsequently, we introduce an adaptive aggregation module to route features between vertices and hyperedges, utilizing structural regularization to stabilize the learning process. Comprehensive experiments across both isotropic and hierarchical configurations confirm that explicitly modeling multi-order interactions yields substantial efficiency benefits. Compared to state-of-the-art vision models, our architecture drastically reduces computational overhead and parameter count while achieving highly competitive accuracy on large-scale classification tasks.
Loading