More Than Meets the Eye: Enhancing Multi-Object Tracking Even with Prolonged Occlusions

Bishoy Galoaa; Somaieh Amraee; Sarah Ostadabbas

More Than Meets the Eye: Enhancing Multi-Object Tracking Even with Prolonged Occlusions

Bishoy Galoaa, Somaieh Amraee, Sarah Ostadabbas

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: A novel multi-object tracking framework, MOTE, that uses deformable transformers, optical flow, and softmax splatting to track objects effectively even under prolonged occlusions.

Abstract: This paper introduces MOTE (MOre Than meets the Eye), a novel multi-object tracking (MOT) algorithm designed to address the challenges of tracking occluded objects. By integrating deformable detection transformers with a custom disocclusion matrix, MOTE significantly enhances the ability to track objects even when they are temporarily hidden from view. The algorithm leverages optical flow to generate features that are processed through a softmax splatting layer, which aids in the creation of a disocclusion matrix. This matrix plays a crucial role in maintaining track consistency by estimating the motion of occluded objects. MOTE's architecture includes modifications to the enhanced track embedding module (ETEM), which allows it to incorporate these advanced features into the track query layer embeddings. This integration ensures that the model not only tracks visible objects but also accurately predicts the trajectories of occluded ones, much like the human visual system. The proposed method is evaluated on multiple datasets, including MOT17, MOT20, and DanceTrack, where it achieves impressive tracking metrics--82.0 MOTA and 66.3 HOTA on the MOT17 dataset, 81.7 MOTA and 65.8 HOTA on the MOT20 dataset, and 93.2 MOTA and 74.2 HOTA on the DanceTrack dataset. Notably, MOTE excels in reducing identity switches and maintaining consistent tracking in complex real-world scenarios with frequent occlusions, outperforming existing state-of-the-art methods across all tested benchmarks.

Lay Summary: Imagine watching security footage of a crowded mall where you need to track specific people, but they keep disappearing behind pillars or other shoppers. Current computer vision systems often lose track and confuse identities when people reappear, like mixing up two people after they cross paths. This is a critical problem for applications from autonomous vehicles to elderly care monitoring. We developed MOTE, which mimics how humans naturally predict where hidden objects will reappear. Just as you can guess where someone will emerge after walking behind a tree based on their walking speed and direction, MOTE combines three techniques: analyzing motion patterns, creating "depth maps" to understand who's in front, and using AI memory to maintain identities. Think of it as giving computers the ability to mentally "fill in the gaps" when objects are temporarily hidden. Our tests show MOTE reduces identity confusion by 25% compared to existing methods while running fast enough for real-time applications. This breakthrough could make self-driving cars safer by better tracking pedestrians who step behind parked vehicles, improve sports analytics by following players through pile-ups, and enhance security systems in crowded spaces where reliable tracking is crucial for public safety.

Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.

Link To Code: https://github.com/ostadabbas/MOTE-More-Than-Meets-the-Eye-Tracking

Primary Area: Applications->Computer Vision

Keywords: Multi-object tracking, occlusion handling, deformable transformers, softmax splatting, optical flow estimation, enhanced track embedding, computer vision, motion estimation, deep learning.

Submission Number: 360

Loading