Abstract: When engaged in complex visual cognition, humans tend to rely on their experience and make decisions after thinking again and again. Inspired by this, we pour similar capability into action recognition and propose a new Think Twice framework, that is, think twice about similar categories that are easy to confuse, thus obtaining performance improvement. Firstly, based on visual similarity, a large language model is applied to cluster all categories of a given dataset into disjoint cliques. Accordingly, a textual prompt for each clique will be generated. Secondly, through the first inference, pseudo-labels are obtained, and then the prompt corresponding to its clique is assigned to each sample. Thirdly, a prompt learning method is integrated to enable the framework to simulate human-like iterative thinking, yielding a final decision. Our proposed framework requires minimal parameters while achieving state-of-the-art parameter-efficient fine-tuning(PEFT) performance across four datasets. Our code is available at https://github.com/KangRuan6/ThinkTwice.
External IDs:dblp:conf/icmcs/RuanXHYSZ25
Loading