Vid2Coach: Transforming How-To Videos into Task Assistants

The University of Texas at Austin
A figure shows an accessible cooking support system helping a blind user slice bell peppers. On the left, a How-To Video shows a person preparing food, and Accessible Resources highlight blind cooking tips. Arrows point to the center, where a blind user wearing meta glasses and an apron slices yellow bell peppers on a cutting board. The user asks, “I’m not confident with knives. Any tips?” and “Does this look complete?” Three guidance sections are displayed: 1. Instructions & Demonstration Details: Slice bell peppers. In the video, the person slices yellow and red bell peppers into thin 1/4 inch wide strips using a kitchen knife and wooden cutting board. 2. Accessible Tips & Workarounds: Use kitchen scissors to cut peppers directly over a tray or bowl, so you can easily find all the pieces with touch. Or, you can wear a cut resistant glove. 3. Proactive Progress Feedback: You don’t seem to be done yet because there are still some larger yellow pepper pieces on the right side. Try feeling for any thicker slices and trimming them down so they match the thinner ones. Keep going, you’re almost there!

Vid2Coach is a system that transforms how-to videos into a wearable camera-based task assistant that provides accessible instructions and mixed-initiative feedback. Given a how-to video, Vid2Coach extracts high-level steps and demonstration details, then for each step we use retrieval-augmented generation to supplement each step with BLV-specific guidelines. Vid2Coach then monitors user progress with a camera in smart glasses to provide proactive feedback.

Abstract

People use videos to learn new recipes, exercises, and crafts. Such videos remain difficult for blind and low vision (BLV) people to follow as they rely on visual comparison. Our observations of visual rehabilitation therapists (VRTs) guiding BLV people to follow how-to videos revealed that VRTs provide both proactive and responsive support including detailed descriptions, non-visual workarounds, and progress feedback.We propose Vid2Coach, a system that transforms how-to videos intowearable camera-based assistants that provide accessible instructions and mixed-initiative feedback. From the video, Vid2Coach generates accessible instructions by augmenting narrated instructions with demonstration details and completion criteria for each step. It then uses retrieval-augmented-generation to supplement non-visual workarounds from BLV-specific resources. Vid2Coach then monitors user progress with a camera embedded in commercial smart glasses to provide context-aware instructions, proactive feedback, and answers to user questions. BLV participants (N=8) using Vid2Coach completed cooking tasks with 58.5% fewer errors than when using their typical workflow and wanted to use Vid2Coach in their daily lives. Vid2Coach demonstrates an opportunity for AI visual assistance that strengthens rather than replaces non-visual expertise

Observational Study

The image shows a formative study setup where a vision rehabilitation therapist (VRT) remotely monitors a blind participant (BLV) cooking through a live video stream from the participant’s smart glasses. On the left, the blind participant is cooking at a stovetop, while on the right, the VRT watches the video feed and provides guidance via a computer setup with dual monitors. Below, three completed dishes made by the participant are displayed: a tray of freshly baked chocolate chip cookies, a pavlova topped with whipped cream and fresh berries, and a plate of eggs Benedict featuring poached eggs on an English muffin with ham.

To inform our system design, we observed how VRTs deliver real-time remote guidance to BLV individuals in following how-to videos while observing their progress through the smart glasses' video stream.





Design Goals


D1. Provide instructions based on both narration and visual demonstrations of how-to videos.

D2. Supplement instructions with accessible tips and workarounds.

D3. Provide proactive visual feedback on user progress.

D4. Encourage users to leverage non-visual sensory cues to evaluate progress.

D5. Address users’ diverse questions with responses grounded in users’ task progress and how-to video knowledge.

D6. Adapt instructions and feedback to user preference, skills, and context.




System

Diagram showing how Vid2Coach generates accessible cooking instructions from a how-to video using multimodal understanding and retrieval-augmented generation (RAG). On the left, frames from a video show someone preparing bell peppers, with narration: Now, I’m preparing the bell pepper for toppings. Arrows labeled A (Multimodal Understanding) and B (Multimodal RAG) lead into a flowchart. The system first generates a High-level Instruction: Slice 2 bell peppers for toppings. Then it adds Demonstration Details: The person is slicing one red and one yellow bell pepper using a sharp chef’s knife on a sturdy wooden cutting board. After cutting them into 2-inch pieces, they place them on a paper plate with herbs and olives. Finally, it provides Tips and Workarounds based on user needs. For low vision: use high color contrast cutting board. For blind users: use a plunge chopper or kitchen scissors. Accessible resources like guidelines and videos are shown on the right as input to the RAG module.

Vid2Coach generates step instructions from a how-to video with multi-modal understanding of narration and frames (A), and supplements tips and workarounds from accessibility task resources using RAG (B).





Side-by-side comparison of Vid2Coach and state-of-the-art VLMs (LLaVA-OV, Gemini, GPT-4o) on two cooking video segments, with screenshots and corresponding descriptions. On the left, a chef adds tarragon vinegar to egg yolks. The narration briefly says: Tarragon vinegar, pop that into the eggs. Vid2Coach provides a detailed task-relevant description including the tool used and substitution advice. LLaVA-OV gives a vague overview, highlighting background elements and adding hallucinated details, which are marked in orange and red. On the right, a chef places bacon on a breakfast dish. The narration simply states: I think that is the perfect breakfast. Vid2Coach describes precise hand actions and the visual result, while Gemini and GPT-4o offer less detailed summaries, adding hallucinated content about emotions and declarations. The figure emphasizes how Vid2Coach adds new task-relevant details (blue) and avoids hallucinations (red) or generic content (orange).

Qualitative comparisons of Vid2Coach descriptions with SOTA VLMs on 2 action sequences. These VLM descriptions often include hallucinations (red) and less task-relevant details (orange). Vid2Coach was able to capture new task-relevant details not covered in the narration (blue).





Diagram illustrating how a system provides progress feedback for a durative cooking action from a how-to video: Melt the butter on medium-low heat until it is brown. The top row shows video frames of butter gradually melting and browning in a pan. Below, three labeled completion criteria define stages: Irrelevant – No butter is visible in the pan; In-Progress – Butter is visible in solid form or liquefying as it melts in bright yellow color; and Complete – The butter is a deep golden brown color. A user’s video stream is shown progressing from turning on the stove to melting butter, with color-coded markers matching the stages. The system offers proactive feedback such as: Great, the butter is melting and is bubbling; and later: You seem to be complete because the butter looks golden brown.

From the how-to video, Vid2Coach generates criteria for classifying user status into irrelevant, in-progress, and complete. As user performs the task, Vid2Coach monitors the progress and provide realtime feedback.




Evaluation



In the baseline condition, participants used their phones to listen to the video instructions (A) or call visual interpreters to get feedback (B). In the Vid2Coach condition, participants wore Meta glasses and used free-form speech to interact with the system.





The figure shows step completion (left) and user-initiated interactions (right) for each participant across two tasks, comparing the Vid2Coach condition to the baseline condition. Each row represents a participant and each column a step in the task. The left panel visualizes whether each step was completed correctly, in the wrong order, omitted, or skipped due to technical errors. The right panel shows the type and frequency of user-initiated interactions during the task, such as navigation questions, progress checks, and external help. Overall, participants using Vid2Coach completed more steps correctly and interacted more frequently with the system, while baseline participants showed more errors, omissions, and reliance on human or AI agents.

Step completion (left) and user-initiated interactions (right) visualized across participants. Each row represents a participant, and each column a step in the task. For Vid2Coach, note that we only included user-initiated interactions and not Vid2Coach’s consistent feedback for brevity.





Bar chart comparing user rating distributions between Baseline and Vid2Coach conditions across multiple measures, with ratings ranging from 1 (negative) to 7 (positive). Measures include helpfulness of instruction and feedback, knowledge gain, ease of navigation, mental, physical, and temporal demand, frustration, effort, and performance. Statistically significant differences are marked: Vid2Coach significantly outperformed the baseline in 8 out of 10 categories, including mental demand, frustration, effort, instruction helpfulness, ease of navigation, and performance. (marked with * for p < 0.05 and ** for p < 0.01).

Distribution of the rating scores for the Baseline and Vid2Coach (1 = negative, 7 = positive). The asterisks indicate the statistical significance as a result of Wilcoxon text ( p < 0.05 is marked with * and p < 0.01 is marked with **).



Demo Video

BibTeX

@article{huh2025vid2coach,
          title={Vid2Coach: Transforming How-To Videos into Task Assistants},
          author={Huh, Mina and Xue, Zihui and Das, Ujjaini and Ashutosh, Kumar and Grauman, Kristen and Pavel, Amy},
          journal={arXiv preprint arXiv:2506.00717},
          year={2025}
        }