AVscript: Accessible Video Editing with Audio-Visual Scripts

UT Austin1, Naver AI2, KAIST3, CMU4, UCLA5
ACM CHI 2023

AVscript is an accessible text-based video editing tool that enables blind and low-vision creators to edit videos efficiently using a screen reader. The audio-visual script provides a narration transcript segmented by scenes, scene descriptions, and highlighted visual errors (e.g., blur, bad lighting).

Abstract

Sighted and blind and low vision (BLV) creators alike use videos to communicate with broad audiences. Yet, video editing remains inaccessible to BLV creators. Our formative study revealed that current video editing tools make it difficult to access the visual content, assess the visual quality, and efficiently navigate the timeline. We present AVscript, an accessible text-based video editor. AVscript enables users to edit their video using a script that embeds the video's visual content, visual errors (e.g., dark or blurred footage), and speech. Users can also efficiently navigate between scenes and visual errors or locate objects in the frame or spoken words of interest. A comparison study (N=12) showed that AVscript significantly lowered BLV creators' mental demands while increasing confidence and independence in video editing. We further demonstrate the potential of AVscript through an exploratory study (N=3) where BLV creators edited their own footage.

System

AVscript’s video pane provides two types of audio notifications: scene change notifications and visual error notifications. When paused, users can activate inspect mode to access detected objects in the current frame.






AVscript’s outline pane displays a navigable summary of the audio-visual script including the high-level scenes and potential edit points, such as pauses and visual errors.






AVscript supports search over the transcribed speech and visual objects in the video. BLV creators can skim the results and click on a search result to jump to the corresponding point in the video.


Pipeline

To segment the footage into multiple scenes, we detect objects in each frame using the nouns extracted from the transcript as custom vocabulary. Then, we segment the footage when there is a salient change in the objects detected in nearby frames. For each scene, we caption the first non-blurry frame, then use as the scene’s title in the audio-visual script.


Evaluation

After the video editing tasks, we measured the cognitive load using NASA-TLX. AVscript significantly outperformed users’ own video editing tools in mental demand, temporal demand, effort, and frustration. AVscript was also rated significantly better in the confidence in the output, independence in reviewing output, and helpfulness in identifying errors.

Video Presentation

BibTeX

@inproceedings{10.1145/3544548.3581494,
        author = {Huh, Mina and Yang, Saelyne and Peng, Yi-Hao and Chen, Xiang 'Anthony' and Kim, Young-Ho and Pavel, Amy},
        title = {AVscript: Accessible Video Editing with Audio-Visual Scripts},
        year = {2023},
        isbn = {9781450394215},
        publisher = {Association for Computing Machinery},
        address = {New York, NY, USA},
        url = {https://doi.org/10.1145/3544548.3581494},
        doi = {10.1145/3544548.3581494},
        abstract = {Sighted and blind and low vision (BLV) creators alike use videos to communicate with broad audiences. Yet, video editing remains inaccessible to BLV creators. Our formative study revealed that current video editing tools make it difficult to access the visual content, assess the visual quality, and efficiently navigate the timeline. We present  AVscript, an accessible text-based video editor.  AVscript enables users to edit their video using a script that embeds the video’s visual content, visual errors (e.g., dark or blurred footage), and speech. Users can also efficiently navigate between scenes and visual errors or locate objects in the frame or spoken words of interest. A comparison study (N=12) showed that  AVscript significantly lowered BLV creators’ mental demands while increasing confidence and independence in video editing. We further demonstrate the potential of  AVscript through an exploratory study (N=3) where BLV creators edited their own footage.},
        booktitle = {Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems},
        articleno = {796},
        numpages = {17},
        keywords = {video, accessibility, authoring tools},
        location = {Hamburg, Germany},
        series = {CHI '23}
        }