# SAMITE: Position Prompted SAM2 with Calibrated Memory for Visual Object Tracking

This repository contains the code for ICLR'26 submission #8621 titled "*SAMITE: Position Prompted SAM2 with Calibrated Memory for Visual Object Tracking*".

> **Abstract**: *Visual Object Tracking (VOT) is widely used in applications like autonomous driving to continuously track targets in videos. Existing methods can be roughly categorized into template matching and autoregressive methods, where the former usually neglects the temporal dependencies across frames and the latter tends to get biased towards the object categories during training, showing weak generalizability to unseen classes. To address these issues, some methods propose to adapt the video foundation model SAM2 for VOT, where the tracking results of each frame would be encoded as memory for conditioning the rest of frames in an autoregressive manner. Nevertheless, existing methods fail to overcome the challenges of object occlusions and distractions, and do not have any measures to intercept the propagation of tracking errors. To tackle them, we present a SAMITE model, built upon SAM2 with additional modules, including: (1) Prototypical Memory Bank: We propose to quantify the feature-wise and position-wise correctness of each frame's tracking results, and select the best frames to condition subsequent frames. As the features of occluded and distracting objects are feature-wise and position-wise inaccurate, their scores would naturally be lower and thus can be filtered to intercept error propagation; (2) Positional Prompt Generator: To further reduce the impacts of distractors, we propose to generate positional mask prompts to provide explicit positional clues for the target, leading to more accurate tracking. Extensive experiments have been conducted on six benchmarks, showing the superiority of SAMITE. The code will be released upon paper acceptance.*

## Setup

- Basic requirement:
  ```
  > conda env create -f env.yaml
  ```
- SAM2:
  
  SAM2 needs to be installed before use. Please see [INSTALL.md](https://github.com/facebookresearch/sam2/blob/main/INSTALL.md) from the original SAM 2 repository for FAQs on potential issues and solutions.
  ```
  > cd sam2
  > pip install -e .
  ```
  Then, download the checkpoints of SAM2:
  ```
  > cd checkpoints
  > sh ./download_ckpts.sh
  ```

## Datasets

Currently, we only provide the instructions for LaSOT and LaSOT-ext datasets, the inference on other datasets would be available after paper acceptance.

- You can download LaSOT and LaSOT-ext with the following command:
  ```
  > python download_dataset.py
  > sh unzip.sh data/LaSOT/
  > sh unzip.sh data/LaSOT-ext/
  ```

- Then, organize the data in the following format:
  ```
  data/LaSOT
  ├── airplane/
  │   ├── airplane-1/
  │   │   ├── full_occlusion.txt
  │   │   ├── groundtruth.txt
  │   │   ├── img
  │   │   ├── nlp.txt
  │   │   └── out_of_view.txt
  │   ├── airplane-2/
  │   ├── airplane-3/
  │   ├── ...
  ├── basketball
  ├── bear
  ├── bicycle
  ...
  ├── training_set.txt
  └── testing_set.txt
  ```

## Inference

Run the following command to evaluate SAMITE on LaSOT and LaSOT-ext (we use SAM2.1-B by default):
- LaSOT:
```
> sh main_lasot.sh 
```
- LaSOT-ext:
```
> sh main_lasot_ext.sh
```

## Evaluation

You can download our [raw results](https://drive.google.com/file/d/1YS7ENA7d7e4ScC9XUKiU07WGoutCdg_Y/view?usp=sharing) (**anonymous** Google Drive), extract files to the root directory, and run the following commands to evaluate LaSOT and LaSOT-ext:
- LaSOT:
```
> python scripts/analysis_results_lasot.py
```
- LaSOT-ext:
```
> python scripts/analysis_results_lasot_ext.py
```

## Acknowledgment

This repository is mainly built based on [SAMURAI](https://github.com/yangchris11/samurai) and [SAM 2](https://github.com/facebookresearch/sam2?tab=readme-ov-file). Thanks for their great work!
