Towards accurate unsupervised video captioning with implicit visual feature injection and explicit

Published: 2024, Last Modified: 13 Nov 2024Pattern Recognit. Lett. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Highlights•Contrast learning is used to minimise the disparities between pseudo-text labels and video features.•Visual clues are aligned with the text generator for consistent semantic enhancement.•Leveraging the Found within the sentences for semantic preservation.•Outperforming existing unsupervised video captioning approaches.
Loading