Abstract: Embodied Everyday Task is a popular task in the embodied AI community, requiring agents to make a sequence of actions based on natural language instructions and visual observations. Traditional learning-based approaches face two challenges. Firstly, natural language instructions often lack explicit task planning. Secondly, extensive training is required to equip models with knowledge of the task environment. Previous works based on Large Language Model (LLM) either suffer from poor performance due to the lack of task-specific knowledge or rely on ground truth as few-shot samples. To address the above limitations, we propose a novel approach called Progressive Retrieval Augmented Generation (P-RAG), which not only effectively leverages the powerful language processing capabilities of LLMs but also progressively accumulates task-specific knowledge without ground-truth. Compared to the conventional RAG methods, which retrieve relevant information from the database in a one-shot manner to assist generation, P-RAG introduces an iterative approach to progressively update the database. In each iteration, P-RAG retrieves the latest database and obtains historical information from the previous interaction as experiential references for the current interaction. Moreover, we also introduce a more granular retrieval scheme that not only retrieves similar tasks but also incorporates retrieval of similar situations to provide more valuable reference experiences. Extensive experiments reveal that P-RAG achieves competitive results without utilizing ground truth and can even further improve performance through self-iterations. We will release the source code to the public.
Primary Subject Area: [Generation] Multimedia Foundation Models
Secondary Subject Area: [Content] Media Interpretation
Relevance To Conference: Our work falls within the domain of Multimedia Foundation Models. We propose a novelty framework to solve embodied everyday tasks called Progressive Retrieval Augmented Generation(P-RAG). P-RAG is based on LLM to enhance the policy of the Agent in embodied everyday tasks progressively. During the interaction between P-RAG and the environment, it is necessary to analyze observations returned by the environment, such as images. P-RAG outperforms existing methods in utilizing few-shot training datasets, and it even provides an iterative approach that can further enhance performance. We hope that P-RAG can provide some insight into applying LLM to more specific and practical tasks of embodied AI.
Supplementary Material: zip
Submission Number: 1418
Loading