Abstract: In this paper, we present a new version of our interactive video retrieval system V-FIRST. Besides the existing features of querying by textual descriptions and visual examples, we propose the usage of an image generator that can generate images from a text prompt as a means to bridge the domain gap. We also include a novel referring expression segmentation module to highlight the objects in an image. This is the first step towards providing adequate explainability to retrieval results, ensuring that the system can be trusted and used in domain-specific and critical scenarios. Searching by a sequence of events is also a new addition, as it proves to be pivotal in finding events from memory. Furthermore, we improved our Optical Character Recognition capability, especially in the case of scene text. Finally, the inclusion of relevant feedback allows the user to explicitly refine the search space. All combined, our system has greatly improved user interaction, leveraging more explicit information and providing more tools for the user to work with.