Focusing on what to decode and what to train: Efficient Training with HOI Split Decoders and Split Target Guided DeNoising
Keywords: human-object interaction detection, transformer
TL;DR: A novel one-stage framework with HOI specific denoising training strategy for human-object interaction detection.
Abstract: Recent one-stage transformer-based methods achieve notable gains in the Human-object Interaction Detection (HOI) task by leveraging the detection of DETR. However, the current methods redirect the detection target of the object decoder, and the box target is not explicitly separated from the query embeddings, which leads to long and hard training. Furthermore, matching the predicted HOI instances with the ground-truth is more challenging than object detection, simply adapting training strategies from the object detection makes the training more difficult. To clear the ambiguity between human and object detection, we propose a novel one-stage framework (SOV), which consists of a subject decoder, an object decoder, and a well-designed verb decoder. Three split decoders with explicitly defined box queries share the prediction burden and accelerate the training convergence. To further improve the training efficiency, we propose a novel Split Target Guided (STG) DeNoising strategy, which leverages learnable object label embeddings and verb label embeddings to guide the training. In addition, for the prediction part, the label-specific information is directly fed into the decoders by initializing the query embeddings from the learnable label embeddings. Extensive experiments show that our method (SOV-STG) achieves 3$\times$ fewer training epochs and 4.68\% higher accuracy than the state-of-the-art method.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Applications (eg, speech processing, computer vision, NLP)
Community Implementations: [ 1 code implementation](https://www.catalyzex.com/paper/focusing-on-what-to-decode-and-what-to-train/code)
5 Replies
Loading