Multi-perspective Traffic Video Description Model with Fine-grained Refinement Approach

Tuan-An To, Minh-Nam Tran, Trong-Bao Ho, Thien-Loc Ha, Quang-Tan Nguyen, Hoang-Chau Luong, Thanh-Duy Cao, Minh-Triet Tran

Published: 2024, Last Modified: 26 May 2026CVPR Workshops 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Analyzing traffic patterns is crucial for enhancing safety and optimizing flow within urban cities. While urban cities possess extensive camera networks for monitoring, the raw video data often lacks the contextual detail necessary for understanding complex traffic incidents and the behaviors of road users. In this paper, we propose a novel methodology for generating comprehensive descriptions of traffic scenarios, combining a vision-language model with rule-based refinements to capture pertinently pedestrian, vehicle, and environment factors. First, a captioning model will generate a general description using processed video as input. Subsequently, this description is refined sequentially through three primary modules: pedestrian-aware, vehicle-aware, and context-aware, enhancing the final description. We evaluate our method on the Woven Traffic Safety datasets in Track 2 of the AI City Challenge 2024, obtaining competitive results with an S2 score of 22.6721. Code will be available at https://github.com/ToTuanAn/AICityChallenge2024_Track2
Loading