Keywords: Transfer Learning, Video Understanding, Fine-grained Video Classification, Video Captioning, Common Sense, Something-Something Dataset
TL;DR: Investigating the link between the complexity and granularity of the task and the quality of extracted features for transfer learning, while the model architecture is fixed for all tasks.
Abstract: In this paper, we investigate the correlation between the degree of detail
(granularity) in the source task and the quality of the learned features
for transfer learning to new tasks. For this purpose, we design a DNN for
action classification and video captioning. The same video encoding
architecture is trained to solve multiple tasks with different
granularity levels. In our transfer learning experiments, we fine-tune a
network on a target task, while freezing the video encoding learned from
the source task. Experiments reveal that training with more fine-grained
tasks tends to produce better features for transfer learning. We use the
Something-Something dataset with over 220,000 videos, and multiple levels
of granularity of the target labels. With impressive coarse-grained and
fine-grained classification results, our model introduces a strong baseline
on the new Something-Something captioning task.
4 Replies
Loading