Long-Form Answers to Visual Questions from Blind and Low Vision People

Abstract

Vision language models can now generate long-form answers to questions about images – long-form visual question answers (LFVQA). We contribute VizWiz-LF, a dataset of long-form answers to visual questions posed by blind and low vision (BLV) users. VizWiz-LF contains 4.2k long-form answers to visual questions, collected from human expert describers and six VQA models. We develop and annotate functional roles of sentences in LFVQA and demonstrate that long-form answers contain information beyond the answer such as explanations and suggestions. We further evaluate 360 long-form answers with BLV and sighted people to understand their preferences and assess the correctness of the long-form answers. Long-form answers generated by models often hallucinate with incorrect information to unanswerable questions (e.g., blurry images), yet BLV people often perceive these long-form answers as correct. To reduce hallucinations, we evaluate VQA models on their ability to correctly abstain from answering unanswerable questions.

VizWiz-LF Dataset

To construct VizWiz-LF dataset, we first sample image-question pairs from the orginal VizWiz dataset. For each visual question, we collect an answer from human expert describers and six vision language models.

Short vs Long Answers

While traditional visual question answering (VQA) datasets contain short-form answers collected from crowdworkers, long-form visual question answering (LFVQA) has a lot of potential providing detailed and nuanced answers including explanations. When answers cannot be found from the provided image, long-form answers can still provide potential answers and suggestions to re-take the photos.

BibTeX

@article{huh2024long, title={Long-Form Answers to Visual Questions from Blind and Low Vision People}, author={Huh, Mina and Xu, Fangyuan and Peng, Yi-Hao and Chen, Chongyan and Murugu, Hansika and Gurari, Danna and Choi, Eunsol and Pavel, Amy}, journal={arXiv preprint arXiv:2408.06303}, year={2024} }

Long-Form Answers to Visual Question from
Blind and Low Vision People

Abstract

VizWiz-LF Dataset

To construct VizWiz-LF dataset, we first sample image-question pairs from the orginal VizWiz dataset. For each visual question, we collect an answer from human expert describers and six vision language models.

Short vs Long Answers

Functional Roles

Definitions and examples of functional roles identified in our dataset

Information Types

Definitions and examples of information types identified in our dataset

Presentation Video

BibTeX

Long-Form Answers to Visual Question from Blind and Low Vision People

Abstract

VizWiz-LF Dataset

To construct VizWiz-LF dataset, we first sample image-question pairs from the orginal VizWiz dataset. For each visual question, we collect an answer from human expert describers and six vision language models.

Short vs Long Answers

Functional Roles

Definitions and examples of functional roles identified in our dataset

Information Types

Definitions and examples of information types identified in our dataset

Presentation Video

BibTeX

Long-Form Answers to Visual Question from
Blind and Low Vision People