Can Multi-Modal LLMs Provide Live Step-by-Step Task Guidance?

Apratim Bhattacharyya; Bicheng Xu; Sanjay Haresh; Reza Pourreza; Litian Liu; Sunny Panchal; Pulkit Madan; Leonid Sigal; Roland Memisevic

Can Multi-Modal LLMs Provide Live Step-by-Step Task Guidance?

Apratim Bhattacharyya, Bicheng Xu, Sanjay Haresh, Reza Pourreza, Litian Liu, Sunny Panchal, Pulkit Madan, Leonid Sigal, Roland Memisevic

Published: 02 Mar 2026, Last Modified: 09 Mar 2026ICLR 2026 Workshop ICBINBEveryoneRevisionsCC BY 4.0

Keywords: Multi-modal, Large language Models, Vision and Language

TL;DR: Multimodal LLMs excel at conversation, yet their real-time, interactive, step-by-step coaching of everyday skills remains untested. We introduce LiveCook to assess this and reveal poor performance.

Abstract: Current state of the art multi-modal Large Language Models (LLM) have advanced conversational abilities. However, their effectiveness as coaches for learning everyday skills by providing live, interactive step-by-step guidance is still untested. Effective guidance requires not only delivering instructions but also detecting their successful execution, as well as identifying and alerting users to mistakes, all of which has to happen in real-time. To evaluate such capabilities, we introduce \benchmark{}, a new benchmark and dataset built upon CaptainCook4D, features densely annotated, timed instructions and feedback messages, specifically including mistake alerts precisely timestamped to their visual occurrence in the video. Extensive evaluation shows that current state of the art multi-modal LLMs struggle with providing live, interactive step-by-step guidance.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 38

Loading