PROCED-MEM: BENCHMARKING PROCEDURAL MEMORY RETRIEVAL IN LANGUAGE AGENTS ACROSS DOMAINS

Published: 03 Mar 2026, Last Modified: 25 Apr 2026ICLR 2026 Workshop MemAgents PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: procedural memory, memory retrieval, language agents, benchmark, embedding evaluation, multimodal retrieval, ALFWorld, OSWorld, distribution shift, mean pooling
TL;DR: We benchmark procedural memory retrieval across text and GUI agent domains, uncovering a generalization cliff (30-42% MAP drop) and a visual similarity trap where screenshots help coarse retrieval but fail at fine-grained procedural matching.
Abstract: We introduce Proced-Mem, a benchmark for procedural memory retrieval in language agents with two sub-domains: text-based household tasks (ALFWorld) and real computer environments (OSWorld). Evaluating retrieval independently of downstream execution is critical because current agent evaluations conflate retrieval with planning and execution, masking whether agents retrieve relevant procedures or succeed despite poor memory access. Proced-Mem evaluates up to seven methods across text, visual, and lexical modalities, using an LLM-as-judge protocol for ALFWorld and a leave-one-out protocol with hierarchical ground truth at two granularity levels for OSWorld. Across both sub-domains, we find a generalization cliff (30–42% MAP degradation on novel contexts) and a granularity-method reversal where visual features rank first at coarse retrieval but last at fine-grained procedural matching. Proced-Mem provides the first diagnostic framework for identifying such failure modes, enabling the principled design of retrieval systems that generalize across granularity levels and modalities.
Submission Number: 85
Loading