Compositional Agentic Formulation Search for Open-Vocabulary Audio-Visual Event Localization
Keywords: OV-AVEL
Abstract: Open-vocabulary audio-visual event localization requires models to temporally localize events and recognize categories that may be unseen during training, exposing a tension between task-specific adaptation and preservation of pretrained multimodal alignment. We recast this problem as policy-aware formulation discovery: instead of fine-tuning pretrained encoders, we keep audio, visual, and text representations frozen and compose them with an agent-discovered symbolic decision formulation and a video-specific neural policy. Our method, Cue2Rule, uses a multi-agent loop to search over executable decision formulations, jointly guided by oracle expressiveness and policy learnability. It then trains a lightweight policy network to predict per-video parameters for the selected formulation, replacing globally fixed thresholds with adaptive parameterization. This decomposition separates reusable decision structure from sample-specific adaptation. On OV-AVEBench, Cue2Rule improves over the best fine-tuning baseline, raising the total average score from 57.8 to 60.2 and unseen-category performance from 55.8 to 59.9, while reducing the seen--unseen gap from 7.1 to 1.2. These results suggest that agent-discovered symbolic structure with neural per-sample instantiation provides an effective decision-layer adaptation strategy for OV-AVEL with frozen encoders.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 55
Loading