iiANET: Inception Inspired Attention Hybrid Network for efficient Long-Range Dependency

iiANET: Inception Inspired Attention Hybrid Network for efficient Long-Range Dependency

TMLR Paper5594 Authors

10 Aug 2025 (modified: 21 Oct 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: The recent emergence of hybrid models has introduced a transformative approach to computer vision, gradually moving beyond conventional convolutional neural networks and vision transformers. However, efficiently combining these two approaches to better capture long-range dependencies in complex images remains a challenge. In this paper, we present iiANET (Inception Inspired Attention Network), an efficient hybrid visual backbone designed to improve the modeling of long-range dependencies in complex visual recognition tasks. The core innovation of iiANET is the iiABlock, a unified building block that integrates a modified global r-MHSA (Multi-Head Self-Attention) and convolutional layers in parallel. This design enables iiABlock to simultaneously capture global context and local details, making it effective for extracting rich and diverse features. By efficiently fusing these complementary representations, iiABlock allows iiANET to achieve strong feature interaction while maintaining computational efficiency. Extensive qualitative and quantitative evaluations on some SOTA benchmarks demonstrate improved performance.

Submission Length: Long submission (more than 12 pages of main content)

Changes Since Last Submission: We thank the reviewers and editors for their valuable feedback. This revision includes the following updates and improvements: * Revised figures for improved clarity and visual consistency. * Updated Section 3.5 and added a schematic table showing the input, operations, and output of each of the three paths. * Expanded Section 3.6 to include computational efficiency analysis. * Added Section 3.7 discussing memory efficiency. * Included the ADE20K dataset in the experimental evaluation (Section 4). * Added qualitative results on detection and segmentation for COCO and ADE20K (Section 4.2). * Introduced new experiments on semantic segmentation using ADE20K (Section 4.3.4). * Added complexity analysis comparing iiABlock and MHSA (Section 4.4.1). * Added a subsection on the effects of different fusion strategies (Section 4.4.4). * Proofread and corrected grammatical and typographical errors throughout.

Assigned Action Editor: ~Mathieu_Salzmann1

Submission Number: 5594

Loading