Composing New Urban Futures with AI

Jamie Littlefield

How Text-to-Image Synthesizers Generate Content

In order to discuss critical approaches to AI, it may be helpful to review the processes text-to-image synthesizers use (See: Ramesh et al., 2021; OpenAI, 2021; Reed et al., 2016). With an eye toward balancing concision with clarity (and an acknowledgment that many details are not included here), this section presents a generalized snapshot of how OpenAI's DALL-E generates synthetic images. While the specifics of these processes differ between applications (programs like Midjourney use GANs or generative adversarial networks while DALL-E does not, for example), many features are similar. As a broad overview, DALL-E generates synthetic images through a 5-stage process:

  1. DALL-E is trained on image/text pairs. DALL-E is trained on an extensive dataset containing pairs of text descriptions and corresponding images. This foundational training allows the model to learn the relationship between textual descriptions and their visual representations. Both the images and the text descriptors are broken into small chunks of meaning called "tokens."
  2. A user inputs a prompt, which is processed as contextualized tokens. After a user submits a textual prompt describing the desired image, DALL-E breaks down the prompt into tokens and contextualizes these tokens. This means it derives a dense representation for each token that understands not only its standalone meaning but also its significance in relation to the entire prompt description. DALL-E analyzes those tokens based on its past exposure to the tokenized pairs of images and text within the training data.
  3. DALL-E sequentially generates a synthetic image. With the contextually enriched textual representation as a foundation, DALL-E begins the image generation process. This is not done in one attempt; rather, it is a sequential process, similar to how ChatGPT generates text. DALL-E predicts parts of the image, often patches, in a sequence, ensuring that the evolving visual content aligns with the input description's intent. The initial output becomes input as the image completes.
  4. DALL-E refines the synthetic image. After the initial synthesis, DALL-E leverages techniques like temperature sampling to refine and diversify the generated image. By adjusting the "temperature" or randomness parameter, DALL-E can produce a spectrum of outputs for a single prompt. This variability ensures that the user can select the most contextually relevant and visually cohesive image from the generated set of image options.
  5. The synthetic image (eventually) becomes a part of a new data training set. Although companies have been reluctant to provide details of what output becomes input for AI systems, it is widely expected that a large number of synthetic images will become a part of future training data. This might happen from formal processes to directly capture and include AI output or through the publication of synthetic images on webpages that are later scraped to form new datasets.

As an introduction to this process, students may find it helpful to think of text-to-image data training sets as old photo albums in which every picture is accompanied by a caption. As it studies this album, DALL-E learns to create its own pictures when given new captions. By breaking down these captions into smaller pieces (like individual words), DALL-E generates new images by remixing visual and textual meanings from past data.