Abstract: We propose to self-supervise a convolutional neural network operating on images using temporal information from videos. The task is to learn a representation of single images and the supervision for this is obtained by learning to group image pixels in such a way that their collective motion is “coherent”. This learning by grouping approach is used as a pre-training as well as segmentation strategy. Preliminary results suggest that the segments obtained are reasonable and the representation learned transfers well for classification. This is a preview of subscription content, log in to check access. Notes Acknowledgements The authors gratefully acknowledge the support of ERC 677195-IDIU and AIMS CDT (EPSRC EP/L015897/1).
0 Replies
Loading