Making Short-Form Videos Accessible
with Hierarchical Video Summaries

UT Austin1, Cornell University2
ACM CHI 2024
The figure shows information from left to right. The most left is the original video represented as a single frame of a woman holding a bowl of salad with the text “Let's make the salad Jennifer Aniston ate every day on the set of Friends.” Two arrows point to the right from this picture. One points to the key frames which is a stack of single frames showing ingredients being added to a bowl. The other arrow points to the transcript of the video (transcribed with automatic speech recognition). 
      Below the keyframes are two arrows (one labeled with the vision to language model  BLIP-2) and one labelled OCR both pointing to Keyframe Text \& Descriptions. To the right of the key frames, key frame text \& description and transcript is an arrow (labelled LLM GPT-4) pointing right to two descriptions which includes on-screen text description (“Let's make the salad Jennifer Aniston ate … 3 cups cooked quinoa 1 cup cucumber 1/3 cup red onion 1/2 cup roasted, salted pistachios 1/2 cup mint … Salt & pepper”) and the shot-by-shot-description (“Shot 1: A 3-second shot showing a lady holding a large salad bowl, mentioning she is making the salad Jennifer Aniston ate…Shot 10: The last shot ends with the woman eating the salad”). 
      To the right of these two descriptions an arrow labelled LLM and GPT-4 points to a long description (“The video showcases a woman preparing and tasting a quinoa salad, attributed to Jennifer Aniston's diet on the 'Friends' set. The salad is made from quinoa, cucumber, …and pepper.”). To the right of the descriptions is an arrow also labelled LLM and GPT-4 pointing right to a short description (“Video shows woman making Jennifer Aniston's rumored 'Friends' quinoa salad.”).

ShortScribe makes short-form videos accessible with hierarchical video descriptions. ~\sysname{} extracts video data by identifying key frames then applying automatic speech recognition (ASR), automated description (BLIP-2), and optical character recognition (OCR). A large language model (GPT-4) then generates multiple descriptions.

Abstract

Short videos on platforms such as TikTok, Instagram Reels, and YouTube Shorts (i.e. short-form videos) have become a primary source of information and entertainment. Many short-form videos are inaccessible to blind and low vision (BLV) viewers due to their rapid visual changes, on-screen text, and music or meme-audio overlays. In our formative study, 7 BLV viewers who regularly watched short-form videos reported frequently skipping such inaccessible content. We present ShortScribe, a system that provides hierarchical visual summaries of short-form videos at three levels of detail to support BLV viewers in selecting and understanding short-form videos. ShortScribe allows BLV users to navigate between video descriptions based on their level of interest. To evaluate ShortScribe, we assessed description accuracy and conducted a user study with 10 BLV participants comparing ShortScribe to a baseline interface. When using ShortScribe, participants reported higher comprehension and provided more accurate summaries of video content.

System

The figure shows two screens of a mobile interface, one on the left and one on the right. Left: A still frame of a woman holding a bowl of salad. Within the video, there is text that reads “Let's make the salad Jennifer Aniston ate every day on the set of Friends.” Elements on top of the video are grouped into four groups A, B, C, and D. In group A, in the bottom left, video text information is displayed (marked with an A) including a short description (“Video shows woman making Jennifer Aniston's rumored 'Friends' quinoa salad.”), username (“@nourished.by.mads”), video caption (“This salad is actually AMAZING - no joke. Ingredients and measuring portions are all listed in the video #salad #saladrecipe #jenniferanistonsalad #friendstvshow #jeniferaniston”), and audio title (“Just a Cloud Away - @Pharell Williams”). The screen also includes button icons aligned vertically along the right of the screen. Group B is the top three buttons which includes previous (upward arrow), play/pause (play icon), next (downward arrow). Group C includes the video description button (information icon). Group D includes like (heart), comment (speech bubble), bookmark (save icon), and share (share icon). Right: A popup is shown that presents three descriptions and a close button in the top right corner. The text includes a long description (“The video showcases a woman preparing and tasting a quinoa salad, attributed to Jennifer Aniston's diet on the 'Friends' set. The salad is made from quinoa, cucumber, red onion, pistachios, mint, parsley, feta, chickpeas, lemon juice, olive oil, salt, and pepper.”), on-screen text (“Let's make the salad Jennifer Aniston ate every day on the set of Friends 3 cups cooked quinoa 1 cup cucumber 1/3 cup red onion 1/2 cup roasted, salted pistachios … 1/4 cup olive oil Salt & pepper”), and shot-by-shot-descriptions (“Shot 1: A 3-second shot showing a lady holding a large salad bowl, mentioning she is making the salad Jennifer Aniston ate every day on the set of Friends.Shot 2: In a 3-second span, we see a visual guide to preparing 3 cups of cooked quinoa. …”).

The ShortScribe interface consists of (a) front screen video information including the short description, username, caption, and audio title, (b) video controls, (c) a button to open the description pane which includes the long description, on-screen text, and shot-by-shot descriptions, and (d) video statistics.


Pipeline

The figure shows a diagram of how the system works. From left to right, the diagram shows 4 groupings or columns, all connected by arrows pointing right. The first group includes Video and Keyframes with an arrow pointing downward from Video to Keyframes. From Video, there is an arrow that says ASR pointing right to Transcript which is in the second group. From Keyframes, there is an arrow that says VLM pointing right to Image Captions and another arrow that says OCR pointing right to on-screen text. Transcript, Image Captions, and on-screen text make up group 2 and are labeled at shot-by-shot details. From group 2, there are two arrows both saying LLM pointing right to long description and shot-by-short descriptions. From group 3, there is one arrow saying LLM pointing right to short description.

ShortScribe takes a video as input, transcribes the audio using automatic speech recognition (ASR), segments the video into shots, and selects the middle frame of each shot as a keyframe. It then processes the transcript, generated image captions (BLIP-2), and on-screen text (OCR) to produce video data for each keyframe. We use a large language model (GPT-4) to summarize this data into a short, long, and shot-by-shot description.


Pipeline Evaluation

A horizontal bar graph of hallucination made by the pipeline. The x-axis is the percentage of videos. The y-axis on the left is description type, which lists (from top to bottom) long, short, 50-word, and per shot. The y-axis on the right is how many videos were analyzed for each description type, which lists (from top to bottom) 58, 58, 58, 18. Short descriptions had the highest percentage of videos with no hallucinations, having 40 out of 58 with no hallucination. The per-shot descriptions had the lowest percentage of videos with no hallucinations, having 7 out of 18 with no hallucination.

We analyzed hallucinations in descriptions for 58 videos (long, short, 50-word descriptions) and for a subsample of 18 videos (per shot descriptions). Descriptions for each video contained 0-7 hallucinations. Short descriptions had the lowest percentage of videos with hallucinations, while shot-by-shot descriptions had the highest percentage of videos with hallucinations.

A horizontal bar graph of hallucination made by the pipeline. The x-axis is the percentage of videos. The y-axis on the left is description type, which lists (from top to bottom) long, short, 50-word, and per shot. The y-axis on the right is how many videos were analyzed for each description type, which lists (from top to bottom) 58, 58, 58, 18. Short descriptions had the highest percentage of videos with no hallucinations, having 40 out of 58 with no hallucination. The per-shot descriptions had the lowest percentage of videos with no hallucinations, having 7 out of 18 with no hallucination.

An analysis of the errors in one of the 2 of 58 videos that had more than three errors in the short description. The video depicts a lighthearted singalong. BLIP-2 mistakenly recognizes a toddler concentrating on singing as angry, and the on-screen text shows a quiz with the lyrics to to a sad song (All To Well by Taylor Swift). The long description and then short description incorrectly infer that the video is sad.


User Evaluation

The figure shows two figures. 
          Left: A vertical bar graph titled “Video Understanding Summary Scores.” The x-axis is video title, listing (from left to right) V1, V4, V6, V2, V7, V5, V3, V8. The y-axis is the average score participants got on their summaries of the videos, listing (from top to bottom) 100\%, 80\%, 60\%, 40\%, 20\%, 0. Each video has two bars, the one on the left being blue (representing participants who used our system) and the one on the right being red (representing participants who used the baseline). V1 has the most significant difference between the system which scored 92\% and the baseline which scored 6\%. V4 had the least difference where the system and baseline both scored 75%.
          Right: A vertical bar graph titled “Video Understanding Ratings.” The x-axis is video title, listing (from left to right) V5, V1, V4, V3, V8, V7, V6, V2. The y-axis is the average accessibility rating participants gave each on a scale from 1 (not accessible) to 7 (accessible), listing (from top to bottom) 7, 5, 3, 1. Each video has two bars, the one on the left being blue (representing participants who used our system) and the one on the right being red (representing participants who used the baseline). V5 has the most significant difference between the system which scored 6.3 and the baseline which scored 1.5. V4 had the least difference between the system which scored 6 and the baseline which scored 6.5.

Video comprehension for videos V1-V8 using our system (left, blue) and a baseline interface (right, orange) measured by scoring participant written video summaries (Video Summary Scores) and participant's ratings of their video understanding (Video Understanding Ratings). Ratings of the video understanding ranged from 1, did not understand, to 7, completely understood. Error bars depict the 95% confidence interval.

A horizontal bar graph titled Feature Usefulness Ratings. The x-axis is the number of participants (1 to 10) and the y-axis shows features which read (from top to bottom), “Long description, short description, per shot description, ocr description, original caption, engagement numbers, audio source, username.” The horizontal bars are filled with colors correlating to how the participants ranked the usefulness of each feature, red being definitely not useful and blue being definitely useful. The first four descriptions had overall high ratings while the last four had much lower ratings.

Participants rated the usefulness of each feature for understanding the video. The description features (first four features) are provided by ShortScribe only, and the remaining features (last four features) were originally available on the short-form video platform. Engagement numbers refers to the number of likes, comments, and bookmarks.

Presentation Video

BibTeX

@article{van2024making,
        title={Making Short-Form Videos Accessible with Hierarchical Video Summaries},
        author={Van Daele, Tess and Iyer, Akhil and Zhang, Yuning and Derry, Jalyn C and Huh, Mina and Pavel, Amy},
        journal={arXiv preprint arXiv:2402.10382},
        year={2024}
      }