Abstract: Identifying novel human-object interaction (HOI) classes with scarce data is a challenging and crucial task in computer vision. Existing methods mainly use coarse global visual information to build class prototypes in meta-learning. Despite their promising results, these methods often fail to capture fine-grained interaction semantics and effectively learn from data with low inter-class variance, leading to suboptimal performance in distinguishing similar categories. To overcome these issues, we propose a new model called hierarchical relation network for few-shot HOI recognition (FS-HOI). This model integrates multi-level interaction clues, spanning from coarse to fine-grained, to enhance HOI features. It employs a unified graph network to capture intra- and inter-relationships among human parts with contextual information, augmented by language-guided attention for semantic mining within each interactive sub-graph. In contrast to conventional methods that depend on global class prototype comparisons, our approach advances metric learning by integrating contrastive mechanisms, utilizing rich instance pairs as comparative references to effectively address inter-class variance. Furthermore, a graph relation network leverages prior knowledge of unknown HOIs, embedding task-specific features into contrastive instances. Our method establishes a new state-of-the-art on three few-shot HOI datasets, with substantial performance gains and ablation studies confirming the efficacy of each component.
External IDs:doi:10.1109/tcsvt.2025.3632509
Loading