Structural Expressiveness in Vision

Yuxuan Yuan

Published: 27 Nov 2025, Last Modified: 05 May 2026OpenReview Archive Direct UploadEveryoneCC BY 4.0

Abstract: Most contemporary visual recognition models — whether convolutional, attentional, or graph-based — operate under an implicit assumption that visual semantics can be adequately captured through pairwise feature exchanges between spatial locations. This paper challenges that assumption at the foundational level. We formally characterize the representational limitations of second-order message passing schemes by proving that nonlinear hypergraph neural operators on a shared vertex set cannot, in general, be reduced to equivalent graph message-passing networks unless the hypergraph structure meets extremely narrow combinatorial criteria. This separation result implies that genuine higher-order relational modeling provides an expressivity gain that cannot be amortized by simply deepening or widening pairwise architectures. Motivated by this theoretical finding, we design a general-purpose visual backbone that explicitly models multi-order correlations through learned hypergraph structures. Our framework comprises two coupled phases: an inductive phase where anchor-guided hyperedge construction algorithms dynamically assemble semantic neighborhoods at multiple granularities, and a propagative phase where a differential multi-stream aggregation scheme updates vertex representations by simultaneously accounting for hyperedge-level coherence and cross-hyperedge interactions. To stabilize the learned hypergraph topology during training, we introduce a structured dropout that randomly suppresses hyperedges according to their learned confidence scores. We instantiate the framework in both isotropic and pyramid configurations. Empirical evaluation on ImageNet demonstrates that our method consistently outperforms strong Transformer and graph-network baselines, delivering up to a 5.1% absolute accuracy gain alongside 76% fewer parameters and 93% lower FLOPs than the corresponding Vision Transformer, establishing a new efficiency-accuracy trade-off frontier for relational vision architectures.