Multi-Body Visual Reasoning

Yuxuan Yuan

Published: 01 Feb 2026, Last Modified: 05 May 2026OpenReview Archive Direct UploadEveryoneCC BY 4.0

Abstract: Conventional vision architectures encode relational structure almost exclusively through pairwise attentional or convolutional operations, which restricts representational capacity to second-order interactions among image regions. This work establishes that such pairwise formulations are fundamentally insufficient for capturing the multi-body semantic dependencies that underlie complex visual scenes. We provide a constructive proof demonstrating that there exist families of visual hypergraph convolution operators whose expressive power cannot be recovered by any finite composition of standard graph convolution layers, even when the latter are granted unrestricted width and nonlinearity — establishing a strict hierarchy of relational expressiveness. Building on this insight, we introduce a hypergraph-centric vision backbone organized around two complementary stages: structure discovery and multi-order information flow. In the structure discovery stage, we develop an anchor-based hypergraph generation construction scheme that leverages learned semantic anchors and spatial-topological priors to adaptively organize image patches into flexible higher-order groupings. The information flow stage employs a novel heterogeneous differential mixed aggregation aggregation mechanism that separately models intra-hyperedge consensus and inter-hyperedge competition, coupled with a stochastic regularization strategy on the learned hypergraph topology that prevents structural overfitting. We evaluate both columnar and hierarchical instantiations of our framework. On ImageNet classification, our approach reduces parameter count by up to 76% and computational cost by up to 93% relative to standard Vision Transformers, while improving top-1 accuracy by as much as 5.1 percentage points, establishing a substantially more favorable efficiency-accuracy frontier than existing graph-based and attention-based backbones.