Window Attention is Bugged: How not to Interpolate Position Embeddings

Daniel Bolya; Chaitanya Ryali; Judy Hoffman; Christoph Feichtenhofer

Window Attention is Bugged: How not to Interpolate Position Embeddings

Daniel Bolya, Chaitanya Ryali, Judy Hoffman, Christoph Feichtenhofer

Published: 16 Jan 2024, Last Modified: 13 Mar 2024ICLR 2024 posterEveryoneRevisionsBibTeX

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: Bug fix, window attention, position embeddings, high resolution finetuning, image classification, video classification, object detection, instance segmentation

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

TL;DR: Interpolating absolute position embeddings for models with window attention (e.g., Hiera, ViTDet) is wrong. We fix it, obtaining significant gains in accuracy / efficiency.

Abstract: Window attention, position embeddings, and high resolution finetuning are core concepts in the modern transformer era of computer vision. However, we find that naively combining these near ubiquitous components can have a detrimental effect on performance. The issue is simple: interpolating position embeddings while using window attention is wrong. We study two state-of-the-art methods that have these three components, namely Hiera and ViTDet, and find that both do indeed suffer from this bug. To fix it, we introduce a simple absolute window position embedding strategy, which solves the bug outright in Hiera and allows us to increase both speed and performance of the model in ViTDet. We finally combine the two to obtain HieraDet, which achieves 61.7 box mAP on COCO, making it state-of-the-art for models that only use ImageNet-1k pretraining. This all stems from what is essentially a 3 line bug fix, which we name "absolute win".

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Primary Area: representation learning for computer vision, audio, language, and other modalities

Submission Number: 1186

Loading