Generative Point Tracking with Flow Matching supplemental README
================================================================

The supplemental material includes a curated selection of videos showcasing GenPT's ability to capture multi-modality in point trajectories (curated due to 100 MB supplemental size limit). Videos come in pairs, with the first showing the integration process of generating samples (overlaid on the first frame of the video where the point is being tracked), and the second showing the sampled trajectories in the video where a given point is being tracked. The appendix can be found in the main paper submission, after the references section.

For all videos, GenPT is tracking a single query point and generating 100 samples.


"benchmarks" folder
-------------------

Contains videos of generated samples on a curated selection of videos from the Dynamic Replica, PointOdyssey, TAP-Vid, and TAP-Vid Sliding Occluder benchmarks. TAP-Vid and TAP-Vid Sliding Occluder videos are only taken from DAVIS and RGB-S. The model being used is the main GenPT model trained on PointOdyssey.


"discriminative vs generative" folder
-------------------------------------

Contains videos of point trajectories from two GenPT models: a standard GenPT model but trained in a shortened training setup (on PointOdyssey), and a variant GenPT model also trained in a shortened training setup (on PointOdyssey). The variant is trained with no generative prior and no conditional probability path (trained in a manner similar to CoTracker3), it doesn't use ODE integration during inference (simply taking K ground truth estimation steps), but during testing we introduce stochasticity by initializing point trajectories, visibillities, and confidences in the same manner as the standard GenPT. This is to show that without generative training, simply adding stochasticity into a discriminative model during test time will not result in meaningful multi-modality.

Videos are taken from the DAVIS test set from the TAP-Vid and TAP-Vid Sliding Occluder benchmarks.


"large L, including ground truth estimates" folder
--------------------------------------------------

Contains trios of videos of point trajectories generated by the main GenPT model (trained on PointOdyssey) from the DAVIS test set of the TAP-Vid benchmark. Here, the number of integration steps (L - 1) is set to 19 so that it is easy to see the linear travel path of the samples as they are integrating---a feature unique to the optimal transport formulation we use for flow matching and a decent sanity check to show that the conditional probability path during testing closely follows an optimal transport path, as devised during training.

For each trio of videos, the first video shows the ground truth estimates made at each integration step, the second video shows the integration process of generated samples, and the third video shows the sampled trajectories in the video where the point is being tracked.

Notice how in the first video of each trio, the ground truth estimates from a given sample grow increasingly accurate from k = 1 to k = K = 3. After the Kth update, the model uses the estimate to integrate to the next sample, from which the ground truth estimation loop starts again. This shows that the iterative refinement is helpful in arriving at a more accurate approximation of the ground truth, which results in a more accurate approximation of the vector field, which results in a more accurate sample at the next integration step.