- Keywords: Transfer Learning, Video Understanding, Fine-grained Video Classification, Video Captioning, Common Sense, Something-Something Dataset
- TL;DR: Investigating the link between the complexity and granularity of the task and the quality of extracted features for transfer learning, while the model architecture is fixed for all tasks.
- Abstract: In this paper, we investigate the correlation between the degree of detail (granularity) in the source task and the quality of the learned features for transfer learning to new tasks. For this purpose, we design a DNN for action classification and video captioning. The same video encoding architecture is trained to solve multiple tasks with different granularity levels. In our transfer learning experiments, we fine-tune a network on a target task, while freezing the video encoding learned from the source task. Experiments reveal that training with more fine-grained tasks tends to produce better features for transfer learning. We use the Something-Something dataset with over 220,000 videos, and multiple levels of granularity of the target labels. With impressive coarse-grained and fine-grained classification results, our model introduces a strong baseline on the new Something-Something captioning task.