AVscript: Accessible Video Editing with Audio-Visual Scripts

UT Austin1, Naver AI2, KAIST3, CMU4, UCLA5
ACM CHI 2023
Teaser image of how AVscript's components work. A video pane of AVscript is shown at the top left corner. There is a timeline bar indicating the progress of the video player and it shows that 5 minutes and 30 seconds has past for a 11 minutes-long video. In the current frame, a pantry filled with food can be seen. Below the video pane, there is a outline pane (bottom left), which summarizes the video by listing the scenes and errors in the video. On the right top of the figure is a tool pane, which allows users to trim or change the playback speed of the selected video clip. On the right to the tool pane is a search pane with a text field, where users can input the search query of both visual objects or narration. Finally, below the tool pane and the search pane is the audiovisual script, which shows a narration transcript segmented by scenes, scene descriptions, and highlighted visual errors such as blurs and bad lightings.

AVscript is an accessible text-based video editing tool that enables blind and low-vision creators to edit videos efficiently using a screen reader. The audio-visual script provides a narration transcript segmented by scenes, scene descriptions, and highlighted visual errors (e.g., blur, bad lighting).

Abstract

Sighted and blind and low vision (BLV) creators alike use videos to communicate with broad audiences. Yet, video editing remains inaccessible to BLV creators. Our formative study revealed that current video editing tools make it difficult to access the visual content, assess the visual quality, and efficiently navigate the timeline. We present AVscript, an accessible text-based video editor. AVscript enables users to edit their video using a script that embeds the video's visual content, visual errors (e.g., dark or blurred footage), and speech. Users can also efficiently navigate between scenes and visual errors or locate objects in the frame or spoken words of interest. A comparison study (N=12) showed that AVscript significantly lowered BLV creators' mental demands while increasing confidence and independence in video editing. We further demonstrate the potential of AVscript through an exploratory study (N=3) where BLV creators edited their own footage.

System

The close-up of the video pane. Over the video player timeline, notification icons are shown to indicate that this information is provided as users play the video. When the user clicks the `i' key, the following speech is provided: ``Inspect mode, currently in the frame: cereal box, snacks, shelf''

AVscript’s video pane provides two types of audio notifications: scene change notifications and visual error notifications. When paused, users can activate inspect mode to access detected objects in the current frame.






The outline pane and the audiovisual script are presented in the figure. In both panes, information about scenes 5 to 8 is shown. On the left outline pane, scene descriptions such as ``The person presses on a time in the oven'' and visual errors such as ``Camera blur in 4:32'' are listed. On the right audio-visual script, all of this information is embedded with the narration script.

AVscript’s outline pane displays a navigable summary of the audio-visual script including the high-level scenes and potential edit points, such as pauses and visual errors.






The tool pane and the search pane are presented in the figure. The user is searching for the keyword ``microwave'' and seven search results are listed, where one of the results is narration search and the other seven results are visual search.

AVscript supports search over the transcribed speech and visual objects in the video. BLV creators can skim the results and click on a search result to jump to the corresponding point in the video.


Pipeline

The figure shows how audio and visual frames of the video are processed to segment scenes, generate scene descriptions, and detect pauses and visual errors. On the right, the resulting audio-visual script is shown.

To segment the footage into multiple scenes, we detect objects in each frame using the nouns extracted from the transcript as custom vocabulary. Then, we segment the footage when there is a salient change in the objects detected in nearby frames. For each scene, we caption the first non-blurry frame, then use as the scene’s title in the audio-visual script.


Evaluation

The stacked bar charts display the distribution of the rating scores for the participants' participants' personal editing tools and AVscript. Blue colors are used to indicate lower responses (positive) and red colors are used to indicate higher responses (negative). AVscript significantly outperformed users' own tools in mental demand, temporal demand, effort, frustration, confidence in the output, independence in reviewing output, and helpfulness in identifying errors.

After the video editing tasks, we measured the cognitive load using NASA-TLX. AVscript significantly outperformed users’ own video editing tools in mental demand, temporal demand, effort, and frustration. AVscript was also rated significantly better in the confidence in the output, independence in reviewing output, and helpfulness in identifying errors.

Video Presentation

BibTeX

@inproceedings{10.1145/3544548.3581494,
        author = {Huh, Mina and Yang, Saelyne and Peng, Yi-Hao and Chen, Xiang 'Anthony' and Kim, Young-Ho and Pavel, Amy},
        title = {AVscript: Accessible Video Editing with Audio-Visual Scripts},
        year = {2023},
        isbn = {9781450394215},
        publisher = {Association for Computing Machinery},
        address = {New York, NY, USA},
        url = {https://doi.org/10.1145/3544548.3581494},
        doi = {10.1145/3544548.3581494},
        abstract = {Sighted and blind and low vision (BLV) creators alike use videos to communicate with broad audiences. Yet, video editing remains inaccessible to BLV creators. Our formative study revealed that current video editing tools make it difficult to access the visual content, assess the visual quality, and efficiently navigate the timeline. We present  AVscript, an accessible text-based video editor.  AVscript enables users to edit their video using a script that embeds the video’s visual content, visual errors (e.g., dark or blurred footage), and speech. Users can also efficiently navigate between scenes and visual errors or locate objects in the frame or spoken words of interest. A comparison study (N=12) showed that  AVscript significantly lowered BLV creators’ mental demands while increasing confidence and independence in video editing. We further demonstrate the potential of  AVscript through an exploratory study (N=3) where BLV creators edited their own footage.},
        booktitle = {Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems},
        articleno = {796},
        numpages = {17},
        keywords = {video, accessibility, authoring tools},
        location = {Hamburg, Germany},
        series = {CHI '23}
        }