Abstract: Highlights•Incorporating text-video similarity into rewards improves sentence distinctiveness.•The distinctiveness can be improved without sacrificing accuracy.•Performance improvement can be achieved without increasing inference time.
Loading