# Uncertainty-quantified Rollout Policy Adaptation for Unlabelled Cross-domain Temporal Grounding


## Abstract
Video Temporal Grounding (TG) aims to temporally locate video segments matching a natural language description (a query) in a long video. While Vision-Language Models (VLMs) are effective at holistic semantic
matching, they often struggle with fine-grained temporal
localisation. Recently, Group Relative Policy Optimisation (GRPO)
reformulates the inference process as a reinforcement learning
task, enabling fine-grained grounding and achieving strong in-domain
performance. However, GRPO relies on labelled data, making it unsuitable in unlabelled domains. Moreover, because videos are large and expensive to store and process, performing full-scale adaptation introduces prohibitive latency and computational overhead, making it impractical for real-time deployment. To
overcome both problems, we
introduce a Data-Efficient Unlabelled Cross-domain Temporal Grounding
method, from which a model is first trained on a labelled source domain, then adapted to a target domain using only a small number of {\em unlabelled videos from the target domain}. This approach eliminates the need for target annotation and keeps both computational and storage overhead low enough to run in real time. Specifically, we introduce
\textbf{U}ncertainty-quantified \textbf{R}ollout \textbf{P}olicy
\textbf{A}daptation (\textbf{URPA}) for cross-domain knowledge transfer in learning video temporal grounding without target labels. URPA generates multiple candidate predictions using
GRPO rollouts, averages them to form a pseudo label, and estimates
confidence from the variance across these rollouts. This confidence
then weights the training rewards, guiding the model to focus on
reliable supervision. Experiments on three datasets across six cross-domain settings show that URPA generalises well using only a few unlabelled target videos. Codes are given in supplemental materials.

## Experimental Setting
* Training-Framework: We utilize the Easy-R1 framework and contribute to video training.
* Model: We select [Qwen2.5-VL-7B](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) as base model.
* Dataset: Charades and ActivityNet-tvg.

## Installation Guide
```
git clone https://github.com/appletea233/Temporal-R1.git
cd Temporal-R1
pip install -e .

# eval with lmms-eval
cd third_party/lmms-eval
pip install -e .
```

