1) KNN visualization
We have included a file named pca.png in the Supporting Documents where we use Principal Component Analysis (PCA) to illustrate how the KNN separates question types in the embedding space. In this figure, we see a clear separation between the VC questions that do not have step name (type 2) versus MR and Counting questions that have the step name (type 1 and 3 in figure). Within these two, there is a comparitively thinner difference, which causes the inaccuracy of 3%. 

It is important to note that KNN is a very simple method which can facilitate automatic LoRA selection. However, we believe that for most practical purposes, it is very easy to simply select the LoRA to be used manually, removing the need for any classifier mechanism. We do agree that KNN is a simple method and that better methods exist that can handle adversarial attacks better. However, kindly note that we introduce the KNN classifier as a proof of concept and it is not the main contribution of the paper. As noted by the reviewer, our main goal is to enhance temporal awareness of the VLM.

2) KNN scalability
To generate the training data for the KNN, we prompted ChatGPT to generate N sentences which are synonymous and non repeating, given the sentences used to train CatVLM. The generated sentences could be used to train the KNN. Here, we used N=50, but this number can be scaled to larger numbers and for more tasks, given the current power of LLMs. Another approach could be to ask a large pretrained LLM to classify the question type among the existing tasks instead of training one's own classifier. This approach is training free and generalizable to most tasks without any data generation. Batch processing of questions can be done to increase scalability and decrease the amortized cost of using the pretrained LLM. We explored using KNN as a simple proof of concept to show that automation in LoRA selection can be achieved. However, we believe that manual selection of LoRA is a very feasible alternative.

3) Practical Advantages of VLMs over CNN/RNNs
There is not a lot of literature on temporally aware LLMs in the medical domain yet and we hope our work encourages further research that also beats CNN/RNN performance. However, we believe that in the current state also, there are practical advantages of using VLMs because of the following:
(i) Handling open-set questions: Users can directly ask CatVLM questions in the natural language. CNN/RNN-based methods would require modifying the input to a fixed structure and vocabulary.
(ii) Easy scalability to more tasks: In CatVLM, adding new tasks is as simple as adding a new LoRA module. This makes it possible to scale it using continual learning and collaborative efforts where researchers can add their LoRA modules for different tasks to a common repository. On the other hand, CNN/RNN approaches typically require designing and training an entirely new architecture for each task.

We agree that more complex clinical questions are a promising next step, and CatVLM’s architecture (task‑specific LoRAs + timestamp‑aware features) is explicitly designed to support such extensions.