Multi-Modal Object Re-identification via Sparse Mixture-of-Experts

Yingying Feng; Jie Li; Chi Xie; Lei Tan; Jiayi Ji

Multi-Modal Object Re-identification via Sparse Mixture-of-Experts

Yingying Feng, Jie Li, Chi Xie, Lei Tan, Jiayi Ji

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Abstract:

We present MFRNet, a novel network for multi-modal object re-identification that integrates multi-modal data features to effectively retrieve specific objects across different modalities. Current methods suffer from two principal limitations: (1) insufficient interaction between pixel-level semantic features across modalities, and (2) difficulty in balancing modality-shared and modality-specific features within a unified architecture. To address these challenges, our network introduces two core components. First, the Feature Fusion Module (FFM) enables fine-grained pixel-level feature generation and flexible cross-modal interaction. Second, the Feature Representation Module (FRM) efficiently extracts and combines modality-specific and modality-shared features, achieving strong discriminative ability with minimal parameter overhead. Extensive experiments on three challenging public datasets (RGBNT201, RGBNT100, and MSVR310) demonstrate the superiority of our approach in terms of both accuracy and efficiency, with 8.4% mAP and 6.9% accuracy improved in RGBNT201 with negligible additional parameters.

Lay Summary:

We mainly explore how to fully and efficiently utilize multispectral data to search for specific targets. Recent systems struggle because they don’t let the different kinds of pictures interact with each other enough, and they can’t decide which details are common to all images and which belong only to one type. Our new method, called MFRNet, tackles both issues. First, it blends the tiniest visual cues from the different image types so they reinforce one another. Second, it smartly keeps shared clues together while storing unique ones separately, all with almost no extra computing cost. Tests on three public datasets show MFRNet finds the right object far more accurately, up to 8 percentage points better, while staying fast and lightweight.

Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.

Primary Area: Applications->Computer Vision

Keywords: Object Re-identification, Multi-Modal Learning, Mixture-of-Experts

Submission Number: 2244

Loading