Distilling Knowledge from Large Video Models for Driver Visual Attention Prediction

Morteza Moradi, Mohammad Moradi, Concetto Spampinato, Ali Borji, Simone Palazzo

Published: 01 Jan 2025, Last Modified: 20 Jul 2025ICASSP 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Driver attention prediction has gained significant attention recently due to its role in developing advanced driver assistance systems (ADAS) and intelligent vehicles. The emergence of video foundation models (VFMs) has opened up new possibilities for improving video understanding tasks like video saliency prediction (VSP). However, these large models are often not cost-effective for ADAS and intelligent vehicles due to their size and resource demands. To address this, we present an early effort to use knowledge distillation for predicting driver visual attention, employing the first VFM-based VSP model, SalFoM, as the teacher network. Given that driver attention prediction datasets are smaller than those used for large models, fine-tuning such models is challenging due to their high parameter count. To overcome this, we designed a VFM-based driver attention prediction network with fewer parameters than the teacher network. Experimental results show our model’s effectiveness on benchmark datasets.