How well can generative AI design and evaluate user interfaces?

Document

Author
Zhenyuan Sun and Chris Baber

Abstract
The inexorable rise of generative artificial intelligence (GenAI) is threatening a range of work domains. In this paper we explore whether user interfaces produced by GenAI compare with those produce by humans, and whether gen-AI can evaluate user interfaces to a human-like standard. We create user interface designs for a burger ordering app using prompts to Midjourney on Discord, DALL-E 3 on ChatGPT4o, and Stable Diffusion 3 on Stable Assistant. All three GenAI apps had problems with legible text and following prompts provided. However, through adjustment of prompting, DALL-E 3 and Stable Diffusion 3 produced viable designs that met the brief. We compared the resulting designs with commercial products and the designs created by 8 competent (human) user interface designers through a survey of 32 participants evaluating the designs using the UEQ-S. We found no difference in pragmatic quality between designs, but the designs from gen-AI were rated significantly higher on hedonic quality than those from the commercial products or human designers (with the commercial apps having lowest ratings on all measures). We then prompted ChatGPT4o and Stable Assistant using the UEQ-S to evaluate the user interface designs. We found little correlation between the ratings of the gen-AI apps and human raters. This suggests that GenAI might have a place (with appropriate prompt engineering) in generating user interface designs but that, at present, it struggles to produce reliable, human-like evaluation of these.

How well can generative AI design and evaluate user interfaces?

Keywords