Abstract: Highlights•Contrast learning is used to minimise the disparities between pseudo-text labels and video features.•Visual clues are aligned with the text generator for consistent semantic enhancement.•Leveraging the Found within the sentences for semantic preservation.•Outperforming existing unsupervised video captioning approaches.
Loading