## A. Development of Video Understanding Methods

The evolution of video understanding methods can be divided into four stages, as shown in Figure 1:

1) Conventional Methods: In the early stages of video understanding, handcrafted feature extraction techniques such as Scale-Invariant Feature Transform (SIFT) [1], SpeededUp Robust Features (SURF) [2], and Histogram of Oriented Gradients (HOG) [3] were used to capture key information in videos. Background Subtraction [4], optical flow methods [5], and Improved Dense Trajectories (IDT) [6], [7] were used to model the motion information for tracking. Since videos can be viewed as time series data, temporal analysis techniques such as Hidden Markov Models (HMM) [8] have also been used to understand video content. Before the popularity of deep learning, basic machine learning algorithms such as Support Vector Machines (SVM) [9], Decision Trees [10], and Random Forests were also used in video classification and recognition tasks. Cluster analysis [11] for classifying video segments, or Principal Component Analysis (PCA) [12], [13] for data dimensionality reduction have also been commonly used methods for video analysis.

2) Early Neural Video Models: Compared with classical methods, deep learning methods for video understanding possess superior task-solving capabilities. DeepVideo [14] and [15] were early methods introducing a deep neural network, specifically a Convolutional Neural Network (CNN), for video understanding. However, the performance was not superior to the best handcrafted feature method due to the inadequate use of motion information. Two-stream networks [16] combined both CNN and IDT to capture the motion information to improve the performance, which verified the capability of deep neural networks for video understanding. To handle long-form video understanding, Long Short-Term Memory (LSTM) was adopted [17]. Temporal Segment Network (TSN) [18] was

<!-- image -->

also designed for long-form video understanding by analyzing and aggregating video segments. Besides TSN, Fisher Vectors (FV) encoding [19], Bi-Linear encoding [20], and Vector of Locally Aggregated Descriptors (VLAD) [21] encoding were introduced [22]. These methods improved performance on the UCF-101 [23] and HMDB51 [24] datasets. Unlike two-stream networks, 3D networks started another branch by introducing 3D CNN to video understanding (C3D) [25]. Inflated 3D ConvNets (I3D) [26] utilizes the initialization and the architecture of 2D CNN, Inception [27], to gain a huge improvement on the UCF-101 and HMDB51 datasets. Subsequently, people began employing the Kinetics-400 (K-400) [28] and SomethingSomething [29] datasets to evaluate the model's performance in more challenging scenarios. ResNet [30], ResNeXt [31], and SENet [32] were also adapted from 2D to 3D, resulting in the emergence of R3D [33], MFNet [34], and STC [35]. To improve the efficiency, the 3D convolution has been decomposed into cascade 2D and 1D convolution in various studies (e.g., S3D [36], ECO [37], P3D [38]). LTC [39], T3D [40], Non-local [41], and V4D [42] focus on long-form temporal modeling, while CSN [43], SlowFast [44], and X3D [45] tend to attain high efficiency. The introduction of Vision Transformers (ViT) [46] promotes a series of prominent models (e.g., TimeSformer [47], VidTr [48], ViViT [49], MViT [50]).

3) Self-supervised Video Pretraining: Transferability [51], [52] in self-supervised pretraining models [53] for video understanding allows them to generalize across diverse tasks with minimal additional labeling, overcoming the early deep learning models' requirements for extensive task-specific data. VideoBERT [54] is an early attempt to perform video pretraining. Based on the bidirectional language model BERT [55], pertaining tasks are designed for self-supervised learning from video-text data. It tokenizes video features with hierarchical k-means. The pretrained model can be fine-tuned to handle multiple downstream tasks, including action classification and video captioning. Following the 'pretrainingfinetuning' paradigm, many studies on pretrained models for video understanding, especially video-language mod- els, have emerged. They either use different architectures (ActBERT [56], SpatiotemporalMAE [57], OmniMAE [58], VideoMAE [59], MotionMAE [60]) or training strategies (MaskFeat [61], VLM [62], ALPRO [63], All-in-One transformer [64], MaskViT [65], CLIP-ViP [66], Singularity [67], LF-VILA [68], EMCL [69], HiTeA [70], CHAMPAGNE [71]). 4) Large Language Models for Video Understanding: Recently, large language models (LLMs) have advanced rapidly [72]. The emergence of large language models pretrained on extensive datasets has introduced a novel in-context learning capability [73]. This allows them to handle various tasks using prompts without the need for fine-tuning. ChatGPT [74] is the first groundbreaking application built on this foundation. This includes capabilities like generating code and invoking tools or APIs of other models for their use. Many studies are exploring using LLMs like ChatGPT to call vision models APIs to solve the problems in the computer vision field, including Visual-ChatGPT [75]. The advent of instruct-tuning has further enhanced these models' ability to respond effectively to user requests and perform specific tasks [76]. LLMs integrated with video understanding capabilities offer the advantage of more sophisticated multimodal understanding, enabling them to process and interpret complex interactions between visual and textual data. Similar to their impact in Natural Language Processing (NLP) [77], these models act as more generalpurpose task solvers, adept at handling a broader range of tasks by leveraging their extensive knowledge base and contextual understanding acquired from vast amounts of multimodal data. This allows them to not only understand visual content but also reason about it in a way that is more aligned with human-like understanding. Many works also explore using LLMs in video understanding tasks, namely, Vid-LLMs.