Minimizing Human Labeling in Training Deep Models for Pedestrian Intention Prediction

Muhammad Naveed Riaz, Maciej Wielgosz, Antonio M. López

Published: 01 Jan 2025, Last Modified: 28 Feb 2026IEEE Transactions on Intelligent Transportation SystemsEveryoneRevisionsCC BY-SA 4.0

Abstract: Accurately predicting whether pedestrians will cross in front of an autonomous vehicle is essential for ensuring safe and comfortable maneuvers. However, developing models for this task remains challenging due to the limited availability of diverse datasets containing both crossing (C) and non-crossing (NC) scenarios. Therefore, we propose a procedure that leverages synthetic videos with C/NC labels and an untrained model whose architecture is designed for C/NC prediction to automatically produce C/NC labels for a set of real-world videos. Thus, this procedure performs a synth-to-real unsupervised domain adaptation for C/NC prediction, so we term it S2R-UDA-CP. To assess the effectiveness of S2R-UDA-CP in self-labeling, we utilize two state-of-the-art models, PedGNN and ST-CrossingPose, and we rely on the publicly-available PedSynth dataset, which consists of synthetic videos with C/NC labels. Notably, once the real-world videos are self-labeled, they can be used to train models different from those used in S2R-UDA-CP. These models are designed to operate onboard a vehicle, whereas S2R-UDA-CP is an offline procedure. To evaluate the quality of the C/NC labels generated by S2R-UDA-CP, we also employ PedGraph+ (another literature referent) as it is not used in S2R-UDA-CP. Overall, the results show that training models to predict C/NC using videos labeled by S2R-UDA-CP achieves performance even better than models trained on human-labeled data. Our study also highlights different discrepancies between automatic and human labeling. To the best of our knowledge, this is the first study to evaluate synth-to-real self-labeling for C/NC prediction.

External IDs:doi:10.1109/tits.2025.3565667