Abstract: In the domain of robotics, achieving Lifelong Open-ended Learning Autonomy (LOLA) represents a significant milestone, especially in contexts where autonomous agents must adapt to unforeseen environmental variations and evolving objectives. This paper introduces VISOR (VisionSimilarity for Open-ended Robotic exploration), a vision-based framework designed to assist robotic agents in autonomously exploring and learning from new environments and objects, whether through guided or random exploration, without reliance on predefined design considerations. In that direction, VISOR acts as a perception mediator, classifying everything a robot encounters in a scene as either known or unknown. It further identifies potential distractors (e.g., background elements), known categories, or objects specified through text seeds. By leveraging recent advancements in vision foundation models, VISOR operates in a training-free manner. It begins by segmenting a scene into its constituent entities, regardless of familiarity, and then extracts robust visual representations for each one. These representations are compared against an adaptive memory system that evolves over time; unknown objects are assigned unique IDs and added to this memory as new classes, enriching the robot's understanding of its environment. We argue that this evolving memory can facilitate guided exploration through prior knowledge, enhancing the efficiency of robotic exploration, and validate this by designing two exploration scenarios and running both simulated and real-world experiments.
Loading