Abstract: Advances in the digitalisation of data have led to large archives of content in media companies. These archives include multimodal data and metadata associated with each media programme. Relating content across different mediums of data and metadata has thus become an emergent challenge, with applications to popular domains such as programme recommendation. In this paper, we worked with combinations of content similarity measures computed from the distances between different forms of textual data obtained from subtitle files and metadata obtained from the genres of programmes. The different forms of textual representations we considered were neural semantic and topic vectors, and a weighted Jaccard distance encoding lexical token rareness. The late fusion combination of these four distances provided the best recommendation results. For a weekly dataset of 145 TV programmes, it increased the precision of the genre-based recommendations by 5.76%. In a monthly dataset of 906 programmes, it achieved an increase of 1.5%. This combination was more efficient than one with audio and video files.
Loading