Abstract: Model fusion is a key strategy for robust recognition
in unconstrained scenarios, as different models provide
complementary strengths. This is especially important for
whole-body human recognition, where biometric cues such
as face, gait, and body shape vary across samples and
are typically integrated via score-fusion. However, existing score-fusion strategies are usually static, invoking all
models for every test sample regardless of sample quality or
modality reliability. To overcome these limitations, we propose FusionAgent, a novel agentic framework that leverages a Multimodal Large Language Model (MLLM) to perform dynamic, sample-specific model selection. Each expert
model is treated as a tool, and through Reinforcement FineTuning (RFT) with a metric-based reward, the agent learns
to adaptively determine the optimal model combination for
each test input. To address the model score misalignment
and embedding heterogeneity, we introduce Anchor-based
Confidence Top-k (ACT) score-fusion, which anchors on the
most confident model and integrates complementary predictions in a confidence-aware manner. Extensive experiments
on multiple whole-body biometric benchmarks demonstrate
that FusionAgent significantly outperforms SoTA methods
while achieving higher efficiency through fewer model invocations, underscoring the critical role of dynamic, explainable, and robust model fusion in real-world recognition systems. Project page: FusionAgent.
Loading