Keywords: Open-Vocabulary Action Recognition, Multi-Modal Pre-training, Multi-Modal Robust Learning
TL;DR: DENOISER: Rethinking the Robustness for Open-Vocabulary Action Recognition
Abstract: As one of the fundamental video tasks in computer vision, Open-Vocabulary Action Recognition (OVAR) has recently gained increasing attention, with the development of vision-language pre-trainings. To enable open-vocabulary generalization, existing methods formulate vanilla OVAR to evaluate the embedding similarity between visual samples and text descriptions. However, one crucial issue is completely ignored: the text descriptions given by users may be noisy, e.g., misspellings and typos, limiting the real-world practicality. To fill the research gap, this paper analyzes the noise rate/type in text descriptions by full statistics of manual spelling; then reveals the poor robustness of existing methods; and finally rethinks to study a practical task: noisy OVAR. One novel DENOISER framework, covering two parts: generation and discrimination, is further proposed for solution. Concretely, the generative part denoises noisy text descriptions via a decoding process, i.e., proposes text candidates, then utilizes inter-modal and intra-modal information to vote for the best. At the discriminative part, we use vanilla OVAR models to assign visual samples to text descriptions, injecting more semantics. For optimization, we alternately iterate between generative-discriminative parts for progressive refinements. The denoised text descriptions help OVAR models classify visual samples more accurately; in return, assigned visual samples help better denoising. We carry out extensive experiments to show our superior robustness, and tho rough ablations to dissect the effectiveness of each component.
Supplementary Material: zip
Primary Area: Machine vision
Submission Number: 8897
Loading