Abstract: Highlights•Our KI2HOI effectively utilizes VLM’s visual–linguistic knowledge and achieves superior zero-shot transferability.•We develop visual and linguistic level strategies to fuse spatial information and semantic information.•SOTA results on HICO-DET/V-COCO in zero-shot and supervised settings via extensive experiments.
Loading