1) Formulation of VC task:
As mentioned in Section 3.1, the task of VC takes in a free-form user question that contains two timestamps and outputs the steps happening in between those timestamps. The question can be free-form, for example: "What happened between t1 and t2" and "What steps were seen during the time from t1 to t2". The output is not architecturally constrained to follow a particular format. However, for ease of metric calculation, we use the same format, namely "s1 from t1 to t2, s2 from t3 to t4" and so on, in the training data annotations. This encourages the model to follow the format for the output and we parse it to collect the step names s1,s2 and corresponding time ranges (t1,t2), (t3,t4), respectively. 

For evaluating the quality of VC, in general video captioning papers like TimeChat and VTimeLLM, there are two types of metrics. One measures the captioning quality by comparing n-grams with the ground truth and is computed by METEOR score or CIDER score. The other metric measures whether the predicted time ranges match the ground truth ranges, which is computed using mAP and mAR. In our case, we output the step name directly. Hence, there is no requirement for n-gram matching. For a given step in the prediction, if it does not exist in the GT, it is a false positive (FP). Conversely, if a step in the ground truth does not exist in the prediction, its a false negative (FN). For both these cases, the IoU is 0. For steps that exist in both the GT and the prediction, we compute the maximum IoU between their corresponding time ranges using the Hungarian algorithm. The average of these maximum IoUs across all examples is the mIoU. If the maximum IoU is greater than a threshold, we consider it a true positive (TP), otherwise a false positive (FP). Thus, Recall is calculated as TP / (TP + FN) and Precision is calculated as TP / (TP + FP). These are averaged across all examples and across three thresholds (0.3, 0.5, 0.7) to get the mAR and mAP, respectively.

Consider this example:
Question: What processes unfolded from 213 to 512?"
Prediction: "Hydrodissection from 213 to 272, Phacoemulsificiation from 272 to 512",
GT: "Rhexis from 213 to 245, Hydrodissection from 245 to 293, Phacoemulsificiation from 293 to 512"
0.4166666666666667,1.0,0.5,0.5,0.6666666666666666,0.5,0.5

There are 3 steps in the GT and prediction combined, namely Rhexis (s1), Hydrodissection (s2) and Phacoemulsification (s3).
The IoU between the GT and predicted time range for each of them are:
s1: 0 (since Rhexis is not in prediction)
s2: (272-245)/(293-213) = 0.33
s3: (512-293)/(512-272) = 0.91
Hence, mIoU for this example would be 0.42. 
For a IoU threshold of 0.3, there are 2 TP, 1 FN and 0 FP. Hence, P@0.3 is 1.0 and R@0.3 is 0.67. Averaging this across 0.5 and 0.7 similarly, we get the complete set of metrics for this example. This is further averaged across all data examples to get the values in Table 1.

Thus, in summary, the mAP measures the ability of the VLM to predict the correct boundaries, while the mAR measures the ability of the VLM to detect all relevant steps. The mIoU denotes how temporally accurate the model is for the detected steps.


2) Confusion regarding False Negative
We sincerely apologize for the typo in Section 4.3, where the correct line should be: "The ground truths that were not matched with any prediction are considered as false negatives, while the predictions that were not true positives were considered as false positives". We have updated it in the revised paper in the supporting documents. We can assure the reviewer that this was purely a writing error and that the implementation and all reported results use the correct definitions. In the example above, we show the metrics for an example, where we consider the step missed by the prediction as a false negative. The IoU is calculated according the Hungarian algorithm that matches each prediction with the maximal IoU prediction. We observed that our model does not produce multiple predictions for the same step and time range since we do not utilize a Q-former based decoding approach that suffers from this issue.

3) Key annotation Detail for Cat-1K
Working with an expert surgeon, we prespecified definitions for the start and stop of each step. A common set of 50 videos was annotated by all annotators to verify agreement among them with a tolerance of 1 second for step boundaries. The remaining videos were annotated by at least one annotator, and a random 10\% sample was annotated by two different annotators to verify inter-annotator agreement (for all videos).

<Can delete >
two annotators worked on ~100 videos each, while the rest of the videos were annotated by the third annotator. Finally, all annotations were evaluated by all annotators for deviations from the prespecified definitions.

For the remaining 250 videos, each video was annotated by at least one annotator
+ random 10\% sample was annotated by two different annotators
Total = 4 different annotators
</Can delete>

<Swaroop's version>
We prespecified definitions for the start and stop of each step working with an expert surgeon. All annotators first annotated a common set of 50 videos. For the remaining videos, one person annotated them and at least two people annotated a random 10% sample of videos. We evaluated these annotations for deviations from the prespecified definitions and retrained the annotators when indicated. We used a tolerance of 1 second to evaluate agreement among annotators.
</Swaroop's version>

4) Comparability
In Table 2, Ablation 3 shows our pipeline, with a single LoRA that handles all three tasks. This setup has VMAE features and the IM-TS Adapter and a single LoRA. We see that the performance is quite low, since without the disentanglement between tasks, the model is not able to generate the correct format for the output. In Table 3, Ablation 1, we use multiple LoRAs but without the VMAE features or IM-TS adapter and we see a similar performance. In Table 3, Ablation 2 and Ablation 3, we see usage of multiple LoRAs along with one of the components gives a clear benefit over the above two cases. Finally, the last row incorporates all the model components and design choices, giving the maximum benefit. 