Toward A Critical Multimodal Composition: Analyzing Bias in Text-to-Image Generative AI

Toward A Critical Multimodal composition

Sierra S. Parker

Data Sets and Biases

AI's inherent human elements embedded through programming and training sets create the potential for outputs that reinforce hegemonic biases, damaging stereotypes, and structural discrimination. Despite the ways that text-to-image generative AI seem capable of depicting anything described, biases can still be reproduced not only through the user's text prompts and the user's agency but also through the models themselves: "AI models are politically and ideologically laden ways of classifying the rich social and cultural tapestry of the Internet—which itself is a pale reflection of human diversity" (Vartiainen and Tedre 16). AI can only create outputs based on the data it has access to and, given that text-to-image generative AI are trained through databases and neural networks like ImageNet and Contrastive Language-Image Pretraining (CLIP), they are going to reproduce biases already in the data sets and sedimented across the web.

Many text-to-image generative AI are trained through an ImageNet or CLIP model, both of which rely on text-image pairs and aim to make object recognition possible in machine vision. ImageNet is a researcher-created image database containing over 14 million images from the internet that have been manually labeled and grouped into interconnected sets based on concepts (with concepts being articulated through cognitive synonyms) (ImageNet). CLIP is OpenAI's machine learning model; it is the training used for OpenAI's Dall-E 2 generator (OpenAI). Rather than manually labeled images like ImageNet, CLIP relies on text that is already paired with images publicly available on the internet. Its source of data comes with benefits: CLIP is less constrained by labor costs, the AI trained on CLIP are adept with everyday natural language used on the web, and the images the AI can produce are less limited and directed by researcher-created categories. Furthermore, the images that AI can produce with CLIP will develop and change along with the internet, enabling responses to new trends and styles. Despite these benefits, however, the lack of finite data sets means that the reason the AI produces images in response to language prompts is less clear. With a finite data set, researchers can look to the data to understand why, for example, all the images produced for a particular textual descriptor contain feminine presenting people. A finite data set can be interrogated, understood, and even updated to ameliorate biases. Instead, AI trained on CLIP produces visuals based on the public internet at large, and what happens behind the scenes is in a black box, opaque to the user.

CLIP trained AI provide no explanations for the images produced and leave it up to user interpretation of the visuals, prompts, and cultures involved. Without easy answers supplied for depicted biases, these AI are fruitful technologies for interrogation of how biases are (re)produced and how they proliferate. In other media, like television or film, depicted biases might be framed as the fault of specified groups of people who impose their perspectives on the product. For example, a tv show cast of all white actors could be attributed to a casting director or a production company. With CLIP AI, however, the products stem from composite biases formed from all the public texts of internet users—the AI scrape biases from culture at large. I have chosen to base this chapter's analysis on Dall-E 2 and Bing Image Creator (a separate AI owned by Microsoft that is powered by OpenAI's technology) because of how their CLIP training makes them ripe for this critical interrogation. An additional rationale for this decision is that both Dall-E 2 and Bing Image Creator are free to use and function entirely online, requiring no additional programs to run and making the two platforms more accessible for classroom use and financially accessible for students.

Biases in AI are often caused by representational bias in the data set. Representational bias stems from incomplete or non-comprehensive data sets that do not accurately reflect the real world. Since access to the internet is itself an economic privilege not equitably available to everyone, an economic representational bias is inherent to AI. Populations with greater access to the internet will likely compose a larger number of the text-image pairings that the AI has access to and, thus, socioeconomic status influences whose voices, languages, and cultures orient the AI's training and output. Three readily identifiable types of representational bias stemming from the training set in text-to-image generative models include "misrepresentation (e.g. harmfully stereotyped minorities), underrepresentation (e.g. eliminating occurrence of one gender in certain occupations) and overrepresentation (e.g. defaulting to Anglocentric perspectives)" (Vartiainen and Tedre 15). Sriniasan and Uchino offer two additional examples of influential representational bias from the perspective of art history: (1) biases in representing art styles through generalization or superficial reflections, and (2) biased historical representations that do not accurately reflect the reality of the event or period. These various kinds of representational biases can have negative sociocultural effects like spreading misinformation, influencing how groups and cultures are referred to and remembered, and creating misunderstandings about historical moments in public memory.

Focusing on CLIP in particular, Dehouche finds that language and image identifiers are paired in ways influenced by cultural biases (Dehouche). For example, attractiveness is linked to femininity and richness is linked to masculinity. Dehouche compares these connections between terms to trending connections between how gender is referred to in the English language, noting that the English language tends to associate adjectives that express richness and poorness to male subjects more often and adjectives communicating attractiveness and unattractiveness to female subjects more often; the AI using CLIP will, resultingly, represent the same gendered stereotypes. In this way, AI are bound to reinforce biases and stereotypes found in the culture and language from which its images and caption pairings are taken.

Composing with AI