Long-Form Answers to Visual Question from
Blind and Low Vision People

UT Austin1, CMU2, HKUST3, UC Boulder4,

COLM 2024 Oral Spotlight (Top 2%)

Our dataset VizWiz-LF contains 4.2k long-form answers to visual questions, collected from human expert describers and six VQA models. To understand the content and structure of long-form visual question answers (LFVQA), we create a taxonomy of functional roles and annotate answer functional roles and information types at a sentence level.

Abstract

Vision language models can now generate long-form answers to questions about images – long-form visual question answers (LFVQA). We contribute VizWiz-LF, a dataset of long-form answers to visual questions posed by blind and low vision (BLV) users. VizWiz-LF contains 4.2k long-form answers to visual questions, collected from human expert describers and six VQA models. We develop and annotate functional roles of sentences in LFVQA and demonstrate that long-form answers contain information beyond the answer such as explanations and suggestions. We further evaluate 360 long-form answers with BLV and sighted people to understand their preferences and assess the correctness of the long-form answers. Long-form answers generated by models often hallucinate with incorrect information to unanswerable questions (e.g., blurry images), yet BLV people often perceive these long-form answers as correct. To reduce hallucinations, we evaluate VQA models on their ability to correctly abstain from answering unanswerable questions.

VizWiz-LF Dataset


To construct VizWiz-LF dataset, we first sample image-question pairs from the orginal VizWiz dataset. For each visual question, we collect an answer from human expert describers and six vision language models.






Short vs Long Answers



While traditional visual question answering (VQA) datasets contain short-form answers collected from crowdworkers, long-form visual question answering (LFVQA) has a lot of potential providing detailed and nuanced answers including explanations. When answers cannot be found from the provided image, long-form answers can still provide potential answers and suggestions to re-take the photos.




Functional Roles



Definitions and examples of functional roles identified in our dataset




Information Types



Definitions and examples of information types identified in our dataset



BibTeX

@article{huh2024long,
  title={Long-Form Answers to Visual Questions from Blind and Low Vision People},
  author={Huh, Mina and Xu, Fangyuan and Peng, Yi-Hao and Chen, Chongyan and Murugu, Hansika and Gurari, Danna and Choi, Eunsol and Pavel, Amy},
  journal={arXiv preprint arXiv:2408.06303},
  year={2024}
}