HMNet: a hierarchical multi-modal network for educational video concept prediction

Published: 2023, Last Modified: 19 Jan 2026Int. J. Mach. Learn. Cybern. 2023EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Educational video concept prediction is a challenging task in the online education system that aims to assign appropriate hierarchical concepts to the video. The key to this problem is to model and fuse the multimodal information of the video. However, most prior studies tend to ignore the incremental characteristics of the educational video, and most of the video segmentation strategies do not apply well to the educational video. Moreover, most existing methods overlook the class hierarchy and do not consider the class dependencies when predicting the hierarchical concepts of a video. To that end, in this paper, we propose a Hierarchical Multi-modal Network (HMNet) framework for predicting the hierarchical concepts of educational videos via fusing the multimodal information and modeling the class dependencies. Specifically, we first apply a video divider for extracting keyframes from the video, which considers the incremental characteristics of the educational video. The video is divided into a series of video sections with subtitles. Then, we utilize a multi-modal encoder to obtain the unified representation for multi-modality. Finally, we design a hierarchical predictor capable of fusing the multi-modality representation, modeling the class dependencies and predicting the hierarchical concepts of video in a top-down manner. Extensive experimental results on two real-world datasets demonstrate the effectiveness and explanatory power of HMNet.
Loading