Vision language models can now generate long-form answers to questions about images – long-form visual question answers (LFVQA). We contribute VizWiz-LF, a dataset of long-form answers to visual questions posed by blind and low vision (BLV) users. VizWiz-LF contains 4.2k long-form answers to visual questions, collected from human expert describers and six VQA models. We develop and annotate functional roles of sentences in LFVQA and demonstrate that long-form answers contain information beyond the answer such as explanations and suggestions. We further evaluate 360 long-form answers with BLV and sighted people to understand their preferences and assess the correctness of the long-form answers. Long-form answers generated by models often hallucinate with incorrect information to unanswerable questions (e.g., blurry images), yet BLV people often perceive these long-form answers as correct. To reduce hallucinations, we evaluate VQA models on their ability to correctly abstain from answering unanswerable questions.
Coming Soon...