Abstract: This article introduces our methods and experimental results in the submission to Action Recognition and Action Anticipation tasks (Track 1) in the Trauma THOMPSON Challenge. This article introduces our methods and experimental results in the submission to Action Recognition and Action Anticipation tasks (Track 1) in the Trauma THOMPSON Challenge. In the medical field, accurately identifying and predicting key actions in Life-Saving Interventions (LSI) procedures are crucial for patient survival and recovery. The essence of the Trauma THOMPSON Challenge lies in using computer vision technology to automatically recognize and predict key actions from a first-person perspective in the medical domain. To tackle this challenge, we extensively employed data processing techniques. We compared various advanced model approaches, including SlowFast, TSN, Video-Swin, and I3D. This challenge employs Top-1 Action accuracy as the evaluation metric, as it requires the simultaneous and accurate prediction of both the verb and noun components of the action, given that actions consist of both a verb and a noun. In the end, we selected the predictions generated by Video-Swin as our final submission, achieving a Top-1 Action accuracy of 0.2277 for Action Recognition and Top-1 Action accuracy of 0.1873 for Action Anticipation on the leaderboard.
Loading