InfoCLIP++: A Multimodal Learning Framework with Multi-Granular Information-Theoretic Alignment and Adaptive Fusion

Hongkang Zhang; Shao-Lun Huang; Yanlong Wang; Ercan Engin KURUOGLU

InfoCLIP++: A Multimodal Learning Framework with Multi-Granular Information-Theoretic Alignment and Adaptive Fusion

Hongkang Zhang, Shao-Lun Huang, Yanlong Wang, Ercan Engin KURUOGLU

17 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal Learning, CLIP, HGR Correlation, Optimal Transport, Efficient Attention Modeling

TL;DR: InfoCLIP++ elevates multimodal learning through multi-granular HGR-OT alignment, dynamic token routing, and efficient hardware-aware approximation, achieving superior zero-shot accuracy, speed, and noise robustness compared to CLIP.

Abstract: Multimodal foundation models such as CLIP have significantly advanced vision-language understanding yet face persistent challenges including coarse semantic alignment, high computational overhead, and sensitivity to noisy inputs. This paper introduces \textit{InfoCLIP++}, an integrated framework addressing these limitations through three synergistic components: multi-granular alignment using constrained optimal transport for pixel-level details and Random-Feature HGR correlation for patch-level and global semantics, differentiable adaptive routing for token and modality pruning via entropy-gradient criteria, and hardware-aware optimization with quantized random feature projections for efficient deployment. The model is trained end-to-end with a composite objective combining alignment losses, contrastive learning, and sparsity regularization. Extensive evaluations demonstrate consistent and significant improvements: 84.3\% zero-shot accuracy on ImageNet-1K, representing an 8.1\% gain over CLIP, 74.5\% R@1 on COCO cross-modal retrieval with a 16.1\% improvement, and a Noise Robustness Score of 0.90 on ImageNet-C. Computationally, InfoCLIP++ reduces FLOPs by 87\% and achieves a 6.8$\times$ speedup on FPGA platforms, establishing it as an efficient and robust foundation for resource-constrained multimodal intelligence.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Submission Number: 9538

Loading