Leveraging MLLMs for Zero-Shot Action Recognition: Concise, Discriminative and anti-Hallucination Prompting
Keywords: Action recognition, Multi-modal large language model, Training-free learning
Abstract: Leveraging the capabilities of large language models (LLMs), multi-modal LLMs (MLLMs) show great promise for zero-shot action recognition (ZSAR). However, current MLLM-based approaches often struggle to accurately locate the right action from many, largely due to issues such as lengthy, vague prompts and hallucinated outputs. In this paper, we introduce CDantiHalP concise, discriminative and anti-hallucination prompting), a novel LLM-driven approach to enhance MLLM performance in ZSAR. CDantiHalP is a training-free, post-refinement method designed to improve recognition accuracy for any baseline model. It consists of two core components: (1) concise, discriminative prompting to effectively distinguish confused action pairs, and (2) logic-contradictory hallucination detection (LogCHalD) to identify and mitigate hallucinations. Rather than relying on MLLMs to select from a broad set of labels, CDantiHalP leverages their strength in pairwise comparison of specific concepts. The use of concise-discriminative prompts highlights the distinguishing features between confused actions, guiding MLLMs to focus on critical differences while remaining alert to potential hallucinations. The LogCHalD framework further enhances response reliability by using a logic-contradictory strategy to detect hallucinated responses for each confused action pair. During inference, CDantiHalP assesses hallucination risk and emphasizes consistency across MLLM outputs to mitigate the impact of hallucinations. Extensive experiments demonstrate that CDantiHalP achieves state-of-the-art performance on various ZSAR datasets.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 23533
Loading