Abstract: The primary challenge of Few-shot Compositional Action Recognition (FSCAR) lies in effectively generalizing to and identifying unseen compositions (i.e., motions and objects) from only a few labeled videos. However, current approaches typically evaluate FSCAR as a subsidiary task of standard CAR, ignoring the insufficient generalization of the former in the regime of more significant distribution bias and limited data. To this end, we thoroughly revisit FSCAR, explicitly acknowledging the crucial role of fine-tuning, and propose a novel trio-tuning-testing framework to alleviate the problem of compositional generalization in few-shot scenarios. Specifically, we devise an effective inner-to-outer baseline, namely Motion-Object Composer (MOC), to hierarchically learn comprehensive representations from concepts to compositions. Furthermore, we propose the Trio-Knowledge Calibration (TKC) strategy to calibrate the inference, by transferring the prior visual and language knowledge learned from fine-tuning. Extensive experiments demonstrate the state-of-the-art performance of the approach compared to current competitive methods.
Loading