Connect the Dots: Zero-Shot Step Detection with Foundation Models

ACL ARR 2025 February Submission4402 Authors

15 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: When performing a task such as making a cup of coffee or replacing a bicycle tyre, an individual or 'User' might seek further guidance to ensure that the task is completed correctly. Foundation models are suitable candidates to provide this guidance automatically. However, a model must first be able to grasp a given situation to provide situated guidance. This work focuses on 'Step Detection' (SD), where a model is asked to detect which step of a task a User is performing given a dialogue history and an image of the current scene. We leverage open-access language and vision-language foundation models to perform zero-shot SD on the Watch, Talk and Guide benchmark. We show that current publicly available models achieve up to 54.40 F1, outperforming ChatGPT-3.5 by 12%. To enhance the performance of VLMs on SD, we propose to apply 'structured Chain of Thought (CoT)'. This approach guides the model through a multi-turn interaction to steer it to the correct answer. We demonstrate that structured CoT can lead to significant improvements when scene images are clear and relevant. We also demonstrate that leveraging predictions from an image classifier trained on in-domain data yields further performance gains.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: cross-modal application,multimodality,vision question answering
Contribution Types: NLP engineering experiment, Reproduction study, Publicly available software and/or pre-trained models
Languages Studied: English
Submission Number: 4402
Loading