GenAssist: Making Image Generation Accessible

UT Austin¹, CMU²,
ACM UIST 2023 - Best Paper Award 🏆

Abstract

Blind and low vision (BLV) creators use images to communicate with sighted audiences. However, creating or retrieving images is challenging for BLV creators as it is difficult to use authoring tools or assess image search results. Thus, creators limit the types of images they create or recruit sighted collaborators. While text-to-image generation models let creators generate high-fidelity images based on a text description (i.e. prompt), it is difficult to assess the content and quality of generated images. We present GenAssist, a system to make text-to-image generation accessible. Using our interface, creators can verify whether generated image candidates followed the prompt, access additional details in the image not specified in the prompt, and skim a summary of similarities and differences between image candidates. To power the interface, GenAssist uses a large language model to generate visual questions, vision-language models to extract answers, and a large language model to summarize the results. Our study with 12 BLV creators demonstrated that GenAssist enables and simplifies the process of image selection and generation, making visual authoring more accessible to all.

Prompt Verification

To help users assess how well their generated images adhered to their prompt, GenAssist provides prompt verification. To perform prompt verification, we first use GPT-4 to generate visual questions that verify each part of the prompt.

A figure that illustrates an example of prompt verification questions and their answers for each of the four images. First question is ``Is there a chef in the image?'' and the answer is all yes for four images. Second question is ``How old is the young chef?'' and the first three images answer ``Young kid'' while the last image says ``Young man''. The third question is ``Is the young chef cooking food?'' and the answers are all yes for the four images. The final question is ``Are the parents present in the image?'' and the answer is all yes except for the second image.

We use BLIP-2 to generate answers to the prompt verification questions for each of the four generated images.

This figure illustrates the example of the summary descriptions for the prompt verification questions. For the question ``Is there a chef in the image?'', the summary description is ``Three images depict a young kid, while Image 4 depicts a young man.'' The second question ``Are the parents present in the image?'', the answer summary is ``Three images show parents present in the image, while image 2 does not.''

To help users quickly find which images do or do not adhere to the prompt, we use GPT-4 to summarize the responses to each question.

Visual Content & Style Extraction

This image illustrates an example of content and style questions answered by BLIP-2. The answers are provided for the same tutorial images. For the question ``What is the setting of the image?'', the answers are all kitchen. For the question ``What are the subjects of the image?'', the answers are father and children for the first image, chef, kitchen and vegetables for the second image, father, mother and son for the third and fourth image. For the question ``What is the emotion of the image?'', all answers are happy. For the question about the usage ``Where would this image be used?'', the answers are ``on a website'', ``in a cookbook'', ``a children's cooking class'', and ``on a website''. Finally, for the question ``What are the main colors?'', the first image answers ``Brown, blue, yellow'', the second image answers ``Black, white, red, green'', the third image answers ``blue and white'', and the final image answers ``Red, yellow, green''.

To enable access to image content and style details that were not specified in the prompt, we extract the visual content and visual style of the generated image candidates. We answer 5 questions (setting, subjects, emotion, likely use, colors) using Visual Question Answering with BLIP-2:

This figure shows an example of the object detection results using Detic per image. The first image depicts spoon, pot, cup, tub, apron, bowl, etc. The second image depicts spoon, sink, tomato, lettuce, hat, bowl, etc. The third image shows spoon, fork, knife, apple, sausage, plate, etc. The fourth image shows spoon, pot, window, flowerpot, plate, frog, etc.

To extract objects information, we use Detic with an open detection vocabulary to enable users to access all objects:

This figures shows the answers related to content and style of the images that were retrieved using CLIP model. First for the medium of the image, image 1 answers cartoon, storybook, and illustration, image 2 answers a stock photo, image 3 answers vector art, and image 4 answers cartoon, storybook, and illustration. For the lighting of the image, all four images answer natural lighting. For the perspective of the images, image 1, 3 and 4 answer medium shot while the second image answers centered shot. For the errors in the image, only image 1 answers poorly drawn hands but not other three images.

To extract information of medium, lighting, perspective, and errors, we answer the following questions for each image candidate by using CLIP to determine the similarity between the image and a limited set of answer choices:

Description Summarization

Prompt

"A young chef is cooking the dinner for his parents"

Similarities

The four images are all related to cooking/food preparation. They all have a natural lighting source, and the emotions depicted in all of them are happy.

Differences

The differences among these images include medium (vector art, stock photo), focus on subject (a boy, a father, a mother), availability of ingredients, kitchen size, utensils being used, whether family members are wearing aprons, whether the family members are cooking individually or collectively, etc.

An illustration of a dad cooking with his two children.

Prompt

"A young chef is cooking the dinner for his parents"

Per Image Description - Image 1

A colorful cartoon illustration of a father and his two young children cooking in a bright kitchen with natural lighting. The kitchen countertop is brown and there is a happy expression on the father's face as there is a smell of food in the air. This illustration is likely to be used on a website.

A photo of a young boy cooking alone in the kitchen.

Prompt

"A young chef is cooking the dinner for his parents"

Per Image Description - Image 2

In this stock photo, a young boy wears a chef's hat as he stands in a modern kitchen. He is preparing a salad using a knife while ingredients are on the kitchen counter. The boy looks happy. The colors used are black, white, red and green. This image would likely be used in a cookbook to show children preparing healthy meals.

A vector art image that depicts parents and their son cooking.

Prompt

"A young chef is cooking the dinner for his parents"

Per Image Description - Image 3

In this vector art image, a family is cooking together in a well-lit kitchen. There is a young boy chef with a man and woman, preparing food with pots, pans and spoons on a gas stove. They're happy while cooking snacks for their family. The main colors used are blue and white. This image would fit in a children's cooking class.

The fourth image is an illustration of parents and their son who is a young man cooking in the kitchen with a wide window.

Prompt

"A young chef is cooking the dinner for his parents"

Per Image Description - Image 4

A cartoon of a family preparing food in a brightly lit small kitchen with natural lighting. The subjects are a young man chef, a father, a mother standing near a counter with pots, pans, and utensils. There is a large window in the kitchen.

BibTeX

@inproceedings{huh2023genassist, title={GenAssist: Making Image Generation Accessible}, author={Huh, Mina and Peng, Yi-Hao and Pavel, Amy}, booktitle={Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology}, pages={1--17}, year={2023} }

GenAssist: Making Image Generation Accessible

Abstract

Prompt Verification

To help users assess how well their generated images adhered to their prompt, GenAssist provides prompt verification. To perform prompt verification, we first use GPT-4 to generate visual questions that verify each part of the prompt.

We use BLIP-2 to generate answers to the prompt verification questions for each of the four generated images.

To help users quickly find which images do or do not adhere to the prompt, we use GPT-4 to summarize the responses to each question.

Visual Content & Style Extraction

To enable access to image content and style details that were not specified in the prompt, we extract the visual content and visual style of the generated image candidates. We answer 5 questions (setting, subjects, emotion, likely use, colors) using Visual Question Answering with BLIP-2:

To extract objects information, we use Detic with an open detection vocabulary to enable users to access all objects:

To extract information of medium, lighting, perspective, and errors, we answer the following questions for each image candidate by using CLIP to determine the similarity between the image and a limited set of answer choices:

Description Summarization

Prompt

"A young chef is cooking the dinner for his parents"

Similarities

The four images are all related to cooking/food preparation. They all have a natural lighting source, and the emotions depicted in all of them are happy.

Differences

Prompt

"A young chef is cooking the dinner for his parents"

Per Image Description - Image 1

A colorful cartoon illustration of a father and his two young children cooking in a bright kitchen with natural lighting. The kitchen countertop is brown and there is a happy expression on the father's face as there is a smell of food in the air. This illustration is likely to be used on a website.

Prompt

"A young chef is cooking the dinner for his parents"

Per Image Description - Image 2

Prompt

"A young chef is cooking the dinner for his parents"

Per Image Description - Image 3

Prompt

"A young chef is cooking the dinner for his parents"

Per Image Description - Image 4

A cartoon of a family preparing food in a brightly lit small kitchen with natural lighting. The subjects are a young man chef, a father, a mother standing near a counter with pots, pans, and utensils. There is a large window in the kitchen.

To enable users to quickly assess their image results, we summarize our pipeline results (prompt verification, prompt guideline, and caption-detail question-answer pairs for each image) to create a image similarities and differences and a per image description for each image using GPT-4.

Examples

Presentation Video

Supplementary Video

BibTeX