GenAssist: Making Image Generation Accessible

UT Austin1, CMU2,
ACM UIST 2023 (Conditionally Accepted)
Teaser image that illustrates how GenAssist generates the comparison description and per image descriptions in the summary table. First, GenAssist takes the input of text prompt ``A young chef is cooking dinner for his parents'' and the four images generated using the prompt. Then, based on the prompt, GenAssist uses GPT4 to ask prompt verification questions and use BLIP2 to answer them. GenAssist also asks questions based on the individual image captions using GPT4 and BLIP2. In addition to prompt verification questions and image based questions, GenAssist also asks questions related to visual content and styles and answer them using BLIP2, Detic and CLIP. Finally, using all of the visual information, the comparison description (similarities and differences) and the per image descriptions are generated using GPT4.

GenAssist is a system that enables blind or low vision creators to generate images by providing rich visual descriptions of the generation results. With the given text prompt and a set of generated images, GenAssist uses a large language model (GPT4) to generate prompt verification questions based on the text prompt, and image-based questions based on individual image captions (BLIP-2). GenAssist also extracts the visual content and style of the images using the vision-language model (CLIP, BLIP2), and object detection model (Detic). All of the information is then summarized using the GPT-4 to generate the comparison descriptions and per-image descriptions .

Abstract

Blind and low vision (BLV) creators use images to communicate with sighted audiences. However, creating or retrieving images is challenging for BLV creators as it is difficult to use authoring tools or assess image search results. Thus, creators limit the types of images they create or recruit sighted collaborators. While text-to-image generation models let creators generate high-fidelity images based on a text description (i.e. prompt), it is difficult to assess the content and quality of generated images. We present GenAssist, a system to make text-to-image generation accessible. Using our interface, creators can verify whether generated image candidates followed the prompt, access additional details in the image not specified in the prompt, and skim a summary of similarities and differences between image candidates. To power the interface, GenAssist uses a large language model to generate visual questions, vision-language models to extract answers, and a large language model to summarize the results. Our study with 12 BLV creators demonstrated that GenAssist enables and simplifies the process of image selection and generation, making visual authoring more accessible to all.

Prompt Verification

A figure that illustrates an example of prompt verification questions. On the left is the input with the instruction ``Generate visual questions that verify whether each part of the prompt is correct. Number the questions.'' and the input prompt ``A young chef is cooking dinner for his parents.'' On the right is the output prompt verification questions: 1. Is there a chef in the image? 2. How old is the young chef? 3. Is the young chef cooking the food? 4. Are the parents present in the image?

To help users assess how well their generated images adhered to their prompt, GenAssist provides prompt verification. To perform prompt verification, we first use GPT-4 to generate visual questions that verify each part of the prompt.



A figure that illustrates an example of prompt verification questions and their answers for each of the four images. First question is ``Is there a chef in the image?'' and the answer is all yes for four images. Second question is ``How old is the young chef?'' and the first three images answer ``Young kid'' while the last image says ``Young man''. The third question is ``Is the young chef cooking food?'' and the answers are all yes for the four images. The final question is ``Are the parents present in the image?'' and the answer is all yes except for the second image.

We use BLIP-2 to generate answers to the prompt verification questions for each of the four generated images.



This figure illustrates the example of the summary descriptions for the prompt verification questions. For the question ``Is there a chef in the image?'', the summary description is ``Three images depict a young kid, while Image 4 depicts a young man.'' The second question ``Are the parents present in the image?'', the answer summary is ``Three images show parents present in the image, while image 2 does not.''

To help users quickly find which images do or do not adhere to the prompt, we use GPT-4 to summarize the responses to each question.




Visual Content & Style Extraction


This image illustrates an example of content and style questions answered by BLIP-2. The answers are provided for the same tutorial images. For the question ``What is the setting of the image?'', the answers are all kitchen. For the question ``What are the subjects of the image?'', the answers are father and children for the first image, chef, kitchen and vegetables for the second image, father, mother and son for the third and fourth image. For the question ``What is the emotion of the image?'', all answers are happy. For the question about the usage ``Where would this image be used?'', the answers are ``on a website'', ``in a cookbook'', ``a children's cooking class'', and ``on a website''. Finally, for the question ``What are the main colors?'', the first image answers ``Brown, blue, yellow'', the second image answers ``Black, white, red, green'', the third image answers ``blue and white'', and the final image answers ``Red, yellow, green''.

To enable access to image content and style details that were not specified in the prompt, we extract the visual content and visual style of the generated image candidates. We answer 5 questions (setting, subjects, emotion, likely use, colors) using Visual Question Answering with BLIP-2:



This figure shows an example of the object detection results using Detic per image. The first image depicts spoon, pot, cup, tub, apron, bowl, etc. The second image depicts spoon, sink, tomato, lettuce, hat, bowl, etc. The third image shows spoon, fork, knife, apple, sausage, plate, etc. The fourth image shows spoon, pot, window, flowerpot, plate, frog, etc.

To extract objects information, we use Detic with an open detection vocabulary to enable users to access all objects:



This figures shows the answers related to content and style of the images that were retrieved using CLIP model. First for the medium of the image, image 1 answers cartoon, storybook, and illustration, image 2 answers a stock photo, image 3 answers vector art, and image 4 answers cartoon, storybook, and illustration. For the lighting of the image, all four images answer natural lighting. For the perspective of the images, image 1, 3 and 4 answer medium shot while the second image answers centered shot. For the errors in the image, only image 1 answers poorly drawn hands but not other three images.

To extract information of medium, lighting, perspective, and errors, we answer the following questions for each image candidate by using CLIP to determine the similarity between the image and a limited set of answer choices:

Description Summarization



To enable users to quickly assess their image results, we summarize our pipeline results (prompt verification, prompt guideline, and caption-detail question-answer pairs for each image) to create a image similarities and differences and a per image description for each image using GPT-4.

Examples

Supplementary Video

BibTeX

Coming soon...