Making Short-Form Videos Accessible with Hierarchical Video Summaries

Making Short-Form Videos Accessible
with Hierarchical Video Summaries

UT Austin¹, Cornell University²
ACM CHI 2024

Abstract

Short videos on platforms such as TikTok, Instagram Reels, and YouTube Shorts (i.e. short-form videos) have become a primary source of information and entertainment. Many short-form videos are inaccessible to blind and low vision (BLV) viewers due to their rapid visual changes, on-screen text, and music or meme-audio overlays. In our formative study, 7 BLV viewers who regularly watched short-form videos reported frequently skipping such inaccessible content. We present ShortScribe, a system that provides hierarchical visual summaries of short-form videos at three levels of detail to support BLV viewers in selecting and understanding short-form videos. ShortScribe allows BLV users to navigate between video descriptions based on their level of interest. To evaluate ShortScribe, we assessed description accuracy and conducted a user study with 10 BLV participants comparing ShortScribe to a baseline interface. When using ShortScribe, participants reported higher comprehension and provided more accurate summaries of video content.

System

The ShortScribe interface consists of (a) front screen video information including the short description, username, caption, and audio title, (b) video controls, (c) a button to open the description pane which includes the long description, on-screen text, and shot-by-shot descriptions, and (d) video statistics.

Pipeline

The figure shows a diagram of how the system works. From left to right, the diagram shows 4 groupings or columns, all connected by arrows pointing right. The first group includes Video and Keyframes with an arrow pointing downward from Video to Keyframes. From Video, there is an arrow that says ASR pointing right to Transcript which is in the second group. From Keyframes, there is an arrow that says VLM pointing right to Image Captions and another arrow that says OCR pointing right to on-screen text. Transcript, Image Captions, and on-screen text make up group 2 and are labeled at shot-by-shot details. From group 2, there are two arrows both saying LLM pointing right to long description and shot-by-short descriptions. From group 3, there is one arrow saying LLM pointing right to short description.

ShortScribe takes a video as input, transcribes the audio using automatic speech recognition (ASR), segments the video into shots, and selects the middle frame of each shot as a keyframe. It then processes the transcript, generated image captions (BLIP-2), and on-screen text (OCR) to produce video data for each keyframe. We use a large language model (GPT-4) to summarize this data into a short, long, and shot-by-shot description.

Pipeline Evaluation

A horizontal bar graph of hallucination made by the pipeline. The x-axis is the percentage of videos. The y-axis on the left is description type, which lists (from top to bottom) long, short, 50-word, and per shot. The y-axis on the right is how many videos were analyzed for each description type, which lists (from top to bottom) 58, 58, 58, 18. Short descriptions had the highest percentage of videos with no hallucinations, having 40 out of 58 with no hallucination. The per-shot descriptions had the lowest percentage of videos with no hallucinations, having 7 out of 18 with no hallucination.

We analyzed hallucinations in descriptions for 58 videos (long, short, 50-word descriptions) and for a subsample of 18 videos (per shot descriptions). Descriptions for each video contained 0-7 hallucinations. Short descriptions had the lowest percentage of videos with hallucinations, while shot-by-shot descriptions had the highest percentage of videos with hallucinations.

An analysis of the errors in one of the 2 of 58 videos that had more than three errors in the short description. The video depicts a lighthearted singalong. BLIP-2 mistakenly recognizes a toddler concentrating on singing as angry, and the on-screen text shows a quiz with the lyrics to to a sad song (All To Well by Taylor Swift). The long description and then short description incorrectly infer that the video is sad.

User Evaluation

$The figure shows two figures. Left: A vertical bar graph titled “Video Understanding Summary Scores.” The x-axis is video title, listing (from left to right) V1, V4, V6, V2, V7, V5, V3, V8. The y-axis is the average score participants got on their summaries of the videos, listing (from top to bottom) 100\%, 80\%, 60\%, 40\%, 20\%, 0. Each video has two bars, the one on the left being blue (representing participants who used our system) and the one on the right being red (representing participants who used the baseline). V1 has the most significant difference between the system which scored 92\% and the baseline which scored 6\%. V4 had the least difference where the system and baseline both scored 75%. Right: A vertical bar graph titled “Video Understanding Ratings.” The x-axis is video title, listing (from left to right) V5, V1, V4, V3, V8, V7, V6, V2. The y-axis is the average accessibility rating participants gave each on a scale from 1 (not accessible) to 7 (accessible), listing (from top to bottom) 7, 5, 3, 1. Each video has two bars, the one on the left being blue (representing participants who used our system) and the one on the right being red (representing participants who used the baseline). V5 has the most significant difference between the system which scored 6.3 and the baseline which scored 1.5. V4 had the least difference between the system which scored 6 and the baseline which scored 6.5.$

Video comprehension for videos V1-V8 using our system (left, blue) and a baseline interface (right, orange) measured by scoring participant written video summaries (Video Summary Scores) and participant's ratings of their video understanding (Video Understanding Ratings). Ratings of the video understanding ranged from 1, did not understand, to 7, completely understood. Error bars depict the 95% confidence interval.

A horizontal bar graph titled Feature Usefulness Ratings. The x-axis is the number of participants (1 to 10) and the y-axis shows features which read (from top to bottom), “Long description, short description, per shot description, ocr description, original caption, engagement numbers, audio source, username.” The horizontal bars are filled with colors correlating to how the participants ranked the usefulness of each feature, red being definitely not useful and blue being definitely useful. The first four descriptions had overall high ratings while the last four had much lower ratings.

Participants rated the usefulness of each feature for understanding the video. The description features (first four features) are provided by ShortScribe only, and the remaining features (last four features) were originally available on the short-form video platform. Engagement numbers refers to the number of likes, comments, and bookmarks.

BibTeX

@article{van2024making, title={Making Short-Form Videos Accessible with Hierarchical Video Summaries}, author={Van Daele, Tess and Iyer, Akhil and Zhang, Yuning and Derry, Jalyn C and Huh, Mina and Pavel, Amy}, journal={arXiv preprint arXiv:2402.10382}, year={2024} }