Abstract: The paper proposes a video transformer architecture for detection of risk events on frail adults with ego video monitoring data. First we introduce an extended taxonomy for risk events, and then we propose a transformer based video recognition model for detection of these risk events. The proposed transformer architecture consists of separable attention for spatial and temporal data. We also introduce a pooling operation on the temporal video data by learning of their importance. The experiments have been conducted on visual data of in-the-wild recorded BIRDS dataset and on Kinetics-400 for benchmarking. The use of the pooling operation in transformers gives an increment of 3% on BIRDS dataset.
0 Replies
Loading