Abstract: For a long time, the most common paradigm in MultiObject Tracking was tracking-by-detection (TbD), where
objects are first detected and then associated over video
frames. For association, most models resourced to motion
and appearance cues, e.g., re-identification networks.
Recent approaches based on attention propose to learn
the cues in a data-driven manner, showing impressive
results. In this paper, we ask ourselves whether simple
good old TbD methods are also capable of achieving
the performance of end-to-end models. To this end,
we propose two key ingredients that allow a standard
re-identification network to excel at appearance-based
tracking. We extensively analyse its failure cases, and
show that a combination of our appearance features with
a simple motion model leads to strong tracking results. Our
tracker generalizes to four public datasets, namely MOT17,
MOT20, BDD100k, and DanceTrack, achieving state-ofthe-art performance. https://github.com/dvl-tum/GHOST.
Loading