LINK: Learning Instance-level Knowledge from Vision-Language Models for Human-Object Interaction Detection
Keywords: Human-object interaction detection, Vision-language model, zero-shot, open-vocabulary
Abstract: Human-Object Interaction (HOI) detection with vision-language models (VLMs) has progressed rapidly, yet a trade-off persists between specialization and generalization.
Two major challenges remain: (1) the sparsity of supervision, which hampers effective transfer of foundation models to HOI tasks, and (2) the absence of a generalizable architecture that can excel in both fully supervised and zero-shot scenarios.
To address these issues, we propose \textbf{LINK}, \textbf{L}earning \textbf{IN}stance-level \textbf{K}nowledge.
First, we introduce a HOI detection framework equipped with a {Human-Object Geometrical Encoder} and a {VLM Linking Decoder}.
By decoupling from detector-specific features, our design ensures plug-and-play compatibility with arbitrary object detectors and consistent adaptability across diverse settings.
Building on this foundation, we develop a {Progressive Learning Strategy} under a teacher-student paradigm, which delivers dense supervision over all potential human-object pairs.
By contrasting subtle spatial and semantic differences between positive and negative instances, the model learns robust and transferable HOI representations.
Extensive experiments on SWiG-HOI, HICO-DET, and V-COCO demonstrate state-of-the-art results, showing that our method achieves strong performance in both zero-shot and fully supervised settings while also exhibiting open-vocabulary capability.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 15124
Loading