Vid2Coach: Transforming How-To Videos into Task Assistants

People use videos to learn new recipes, exercises, and crafts. Such videos remain difficult for blind and low vision (BLV) people to follow as they rely on visual comparison. Our observations of visual rehabilitation therapists (VRTs) guiding BLV people to follow how-to videos revealed that VRTs provide both proactive and responsive support including detailed descriptions, non-visual workarounds, and progress feedback.We propose Vid2Coach, a system that transforms how-to videos intowearable camera-based assistants that provide accessible instructions and mixed-initiative feedback. From the video, Vid2Coach generates accessible instructions by augmenting narrated instructions with demonstration details and completion criteria for each step. It then uses retrieval-augmented-generation to supplement non-visual workarounds from BLV-specific resources. Vid2Coach then monitors user progress with a camera embedded in commercial smart glasses to provide context-aware instructions, proactive feedback, and answers to user questions. BLV participants (N=8) using Vid2Coach completed cooking tasks with 58.5% fewer errors than when using their typical workflow and wanted to use Vid2Coach in their daily lives. Vid2Coach demonstrates an opportunity for AI visual assistance that strengthens rather than replaces non-visual expertise

Vid2Coach: Transforming How-To Videos into Task Assistants

Abstract

Observational Study

To inform our system design, we observed how VRTs deliver real-time remote guidance to BLV individuals in following how-to videos while observing their progress through the smart glasses' video stream.

Design Goals

System

Vid2Coach generates step instructions from a how-to video with multi-modal understanding of narration and frames (A), and supplements tips and workarounds from accessibility task resources using RAG (B).

Qualitative comparisons of Vid2Coach descriptions with SOTA VLMs on 2 action sequences. These VLM descriptions often include hallucinations (red) and less task-relevant details (orange). Vid2Coach was able to capture new task-relevant details not covered in the narration (blue).

From the how-to video, Vid2Coach generates criteria for classifying user status into irrelevant, in-progress, and complete. As user performs the task, Vid2Coach monitors the progress and provide realtime feedback.

Evaluation

In the baseline condition, participants used their phones to listen to the video instructions (A) or call visual interpreters to get feedback (B). In the Vid2Coach condition, participants wore Meta glasses and used free-form speech to interact with the system.

Step completion (left) and user-initiated interactions (right) visualized across participants. Each row represents a participant, and each column a step in the task. For Vid2Coach, note that we only included user-initiated interactions and not Vid2Coach’s consistent feedback for brevity.

Distribution of the rating scores for the Baseline and Vid2Coach (1 = negative, 7 = positive). The asterisks indicate the statistical significance as a result of Wilcoxon text ( p < 0.05 is marked with * and p < 0.01 is marked with **).

Demo Video

BibTeX