Toward A Critical Multimodal Composition: Analyzing Bias in Text-to-Image Generative AI
Sierra S. Parker Pennsylvania State University
Introduction
As artificial intelligence (AI) chatbots, like ChatGPT, have rapidly developed capabilities within recent years, and as academia's resulting interest in AI models has skyrocketed, issues regarding ethics and plagiarism, information accuracy, and citation have come to the forefront. These are just a few implications that academics must pose questions about, however, and remaining within the purview of these issues risks overlooking both the potential for using AI as a cultural artifact for analysis and the multiple modes of AI that students will need to navigate ethically in their education, careers, and lives. This chapter proposes a critical analysis approach that brings AI's bias vividly to the fore, illustrating the method with text-to-image generative AI output. This approach can help composition teachers balance concerns about integrity while also benefiting from artificial intelligence's ability to generate a range of artifacts for analysis in the classroom.
While a focus on academic integrity is essential for writing teachers and writing programs alike, this focus often privileges the textual, overlooking major conversations about the relationship between text and visuals in composition and excluding a popular form of AI: text-to-image generative models. These models have been met with similar concerns in art and technology disciplines to those that ChatGPT raises in writing studies, concerns evidenced by recent published scholarship and conference presentations about AI's influence on copyright, ethics, and the creative process. Since various models can generate images in various styles and forms from description written in everyday language, people with all ranges of design skills or artistic practice can easily create visuals using these AI. This functionality has led scholars to question whether AI art meets the same standards as non-AI art (Mazzone and Elgammal; Elgammal et al.; Hong and Curran); how AI counts as a creative process and whether attribution should be given to the AI (Oppenlaender; Mazzone and Elgammal; Coeckelbergh; Hertzmann; Zylinska); how AI will influence jobs in creative fields in terms of processes, roles, and availability (Vimpari et al.; Vartiainen and Tedre; Ko et al.); and how bias in data sets might affect users and cultures through the produced images (Vartiainen and Tedre; Srinivasan and Parikh; Srinivasan and Uchino; Dehouche; Zylinska).
As rhetoric and composition scholars have already established, the process of relating text and images is complex (Rawson; Jack). Through analyses of archival description and images, K. J. Rawson illustrates how metadata influences the ways images are organized, accessed, used, and interpreted. Through description and selection, the archivist thus plays a crucial and subjective role, building the infrastructure that influences the image's future. When AI text-to-image generators pull from image-text pairs (either from a specified data set or the internet), the text that already accompanies the images holds the same directing power as archival description. For example, if images of squirrels are labeled primarily as "rodents" rather than "squirrels," the images the AI creates for the descriptor "rodent" will be more likely to include squirrel characteristics. In the long term, this may change how viewers think of squirrels and/or rodents. Viewers may be more inclined to dislike squirrels for their rodent status, or they may be more accepting of various species of rodents because of their connection to cute squirrels. In other words, the text that people subjectively connect to images will influence the AI's output, potentially altering how people view the subjects based on the prompt-image relationship. The AI's infrastructure creates layers of interpretation: subjectivity is embedded through how the original creators labeled the images, how the AI extracts data from the image-text pairs, and how the user writes the text prompt for an output.
As the subjectivity of image and text relationships becomes more complex with these technologies, a critical step in students' ongoing multimodal literacy development will be to examine the processes and visual pedagogies of AI images. Rawson explains that rhetoric equips a person to interrogate the infrastructures that organize images in archives (330). In the same way, I argue that rhetoric equips people to interrogate the infrastructures of AI, especially given the opacity and shifting contexts of text-to-image generators that source imagery from an evolving internet landscape.
The approach to AI images proposed here presents a twenty first century pedagogy of sight. Working with seventeenth century images based on the then-new microscope, Jordynn Jack offers the phrase "pedagogy of sight" as "a rhetorical framework that instructs readers how to view images in accordance with an ideological or epistemic program" (192). A pedagogy of sight teaches people how to view an image in a specific, ideologically invested way. Jack's term engages "the specific rhetorical strategies rhetors use to teach their readers how to see and interpret an image" (193 emphasis original). With AI, there is no single rhetor who teaches readers how to interpret the images generated because the agency is distributed across multiple actors. However, pedagogy of sight as a critical frame for analysis can equip teachers and students to interrogate outputs rhetorically, understanding how audiences are affected by AI's ideological ladenness and, thus, how audiences will learn to interpret the outputs.
Because AI sources composite information from available text-image pairs, it often reaffirms majority values and norms. The AI's pedagogy of sight supports this reaffirmation by operating implicitly in several ways. The natural language inputs persuade users that they are creating images solely from their own text prompts; on the surface, the output does not appear to be informed by cultural biases from across the Internet/dataset. Additionally, it is easy to assume that the internet/data is an accurate representation of human participation rather than a social construction with hierarchies of accessibility and representation. Finally, the genres that the AI incorporates in its output are familiar, already invested with assumptions and ideologies. Headshots invoke professionalism; mugshots invoke legality and criminality. Though similar in how they frame the upper body, the genres differ through the environment, the body's positioning and dress, and the subject's affect. In other words, when these genre characteristics appear, viewers are oriented in a visual register and led to interpret the images in directed ways.
Viewing AI-generated images as having a pedagogy of sight also promotes technological literacy development. A contemporary goal of teaching writing is supporting students' growth as critical rhetoricians invested in an increasingly digital world. To be a rhetorician is to understand what means of persuasion are available and to know when and how to use each one. In Multiliteracies for a Digital Age, Stuart Selber encourages composition teachers to foster "multiliteracies" with technology, going beyond functional application and engaging with critical and rhetorical literacies, too. Functional literacy allows the student to effectively use a tool, understand the possibilities of using the tool, and handle issues when working with it. For text-to-image generative AI, the student would understand how to use description effectively to produce the images that they desire. Critical literacy extends this tool-oriented knowledge, cultivating an understanding of technologies as ideologically laden artifacts shaped by institutionally informed designs and practices. Critical literacy allows students to question, contextualize, and make informed critiques about the technology. For AI, this means that students learn how the AI produces its output and the ways that the data drawn upon is already inflected by factors like the economics of participation on the internet, representational bias, stereotypes, and image captioning practices. Recognizing the ideological inflections of AI is a first step toward interrogating the output's pedagogy and helps students interrogate ethical issues like when and how AI should be used. Finally, rhetorical literacy accounts for students critically and reflectively engaging with the technology as part of navigating the rhetorical situation. By teaching students to question the technology and to practice reflecting critically on the visual pedagogies an output invokes, composition instructors support ethically responsible use of AI through literacy practices that are transferable to other technologies and visual artifacts.
Using text-to-image generative AI in the classroom, thus, serves multiple learning goals. The AI can generate artifacts for analysis and contains processes worthy of analysis in themselves. The complex interplay between text and image at the level of both the AI's sourcing and the student's prompting can cue students into the framing relationship between language and visual interpretation. The implicitly functioning visual register of AI generated output makes the artifacts' persuasive power less overt than text, requiring viewers to interrogate taken for granted visual grammars to interpret the biases and effects of the message. Additionally, interrogation of AI text-to-image generation instructs a critical and reflective use of technologies as ideologically laden, thus, fostering multiliteracies with writing technologies and developing student multimodal composition practices.
In what follows, I explain how various kinds of text-to-image generative AI compose images, expanding on the ways bias can be sourced and proliferated through AI. I review scholarship to outline some general conversations about biases and data sets. Then, I analyze findings from my own experiments using two text-to-image generative models, Dall-E 2 and Bing Image Creator. Finally, I conclude by offering strategies for integrating text-to-image generative AI in the composition classroom to produce artifacts for critical analysis. I sketch ways to use AI to support learning goals like developing multimodal composition practices and fostering multiliteracies with technology.