Abstract: Current Siamese and Transformer trackers commonly use various subtask branches like regression and classification to predict object states. Despite the demonstrated success, these subtask branches might introduce location and scale offsets due to discrepancies and misalignment in the respective predictions. To address this, we propose a novel generative tracker, MIMTrack, which defines tracking as a Masked Image Modeling (MIM) process combined with in-context learning (ICL). MIMTrack begins with building the visual prompt image, which consists of a template, a search area, and two target images associated with them. The target image transforms the bounding box into a unified RGB image space as other tracking image. All states prediction are naturally aligned by pixels generation of search target image. In light of this, we perform a MIM process within the visual prompt to reconstruct a masked search target image using the context from other parts. MIM with ICL makes use of implicit cross-relations between template and search area. A singlestream generative framework reduces the offset in the estimation. Furthermore, a latent memory module is introduced as a plugin to enhance pixel generation by leveraging various target appearances over time. The advanced performance observed on leading benchmark datasets highlights the simplicity and effectiveness of our MIMTrack framework.
Loading