AnomalyLMM: Bridging Generative Knowledge and Discriminative Retrieval for Text-based Person Anomaly Search

Hao Ju; Hu Zhang; Zhedong Zheng

AnomalyLMM: Bridging Generative Knowledge and Discriminative Retrieval for Text-based Person Anomaly Search

Hao Ju, Hu Zhang, Zhedong Zheng

04 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Text-based person anomaly search, Fine-grained cross-modal alignment, Long-tail anomaly recognition, Large multi-modal model

Abstract: Text-based person anomaly search requires fine-grained alignment between language queries and subtle visual cues, a challenge amplified by the long-tailed nature of anomalous behaviors. While Large Multimodal Models (LMMs) offer powerful visual understanding, their generative pre-training is fundamentally misaligned with the discriminative needs of this retrieval task. Adapting them without costly fine-tuning remains a significant hurdle. We introduce \textbf{AnomalyLMM}, a training-free, coarse-to-fine framework that effectively repurposes generative LMMs for zero-shot anomaly retrieval. Our method first uses a general retrieval model to produce an initial ranking. To refine this list, we introduce a novel cloze-based re-ranking mechanism with three steps. The first step \textbf{Cloze Generation} converts the text query into a ``fill-in-the-blank'' prompt. Next, \textbf{Cloze Completion} compels the LMM to focus on specific visual regions and generate a description of the potential anomaly. The final step \textbf{Comparison \& Re-ranking} calculates semantic alignment between the LMM's generated completion and the original query, which serves as a powerful re-ranking score. Experiments on the PAB dataset show that AnomalyLMM surpasses the competitive baseline by $+0.96$\% Recall@1 accuracy. Crucially, our method provides highly interpretable visual-textual alignments without any task-specific training, a vital feature for real-world deployment. The code will be made publicly available.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 2154

Loading