Precision-Guided Image Generation: Better text-to-image results with latent diffusion


Typical text-to-image generators can generate pictures of a cat, but not your cat. That’s because it’s hard to describe in a text prompt precisely all the things that distinguish your pet from other members of the same species. A new approach guides diffusion models in a way that can produce pictures of your darling Simba.

What’s new: Rinon Gal and colleagues at Nvidia and Tel-Aviv University devised a method to make a diffusion-based, text-to-image generator produce pictures of a particular object or in a particular style.

Basics of diffusion models: During training, a text-to-image generator based on diffusion takes a noisy image and a text description. A transformer learns to embed the description, and a diffusion model learns to use the embeddings to remove the noise in successive steps. At inference, the system starts with pure noise and a text description, and iteratively removes noise according to the text to generate an image. A variant known as a latent diffusion model saves computation by removing noise from a small, learned vector of an image instead of a noisy image.

Key insight: A text-to-image generator feeds text word embeddings to an image generator. Adding a learned embedding that represents a set of related images can prompt the generator to produce common attributes of those images in addition to the semantic content of words.

How it works: The authors used a text-to-image generator based on a latent diffusion model. The system was pretrained on 400 million text-image pairs scraped from the web. Its weights were frozen.

  • The authors fed the system three to five images that shared an object (in different rotations or settings) or style (depicting different objects). They also gave it a text description of the images with a missing word denoted by the characters S∗. Descriptions included phrases like “a painting of S∗” or “a painting in the style of S∗”.
  • The transformer learned an embedding of S∗, which represented attributes the images had in common.
  • Given a prompt that included “S∗” — for instance, “a grainy photo of S∗ in Angry Birds” — the transformer embedded the words and S∗. The latent diffusion model took the embeddings and produced an image.

Results: The authors evaluated their model’s output by comparing embeddings, generated by CLIP, of original and generated images. They measured similarity on a scale from 0 to 1, where 1 signifies two identical inputs. The model scored around 0.78. Images generated using human-crafted descriptions of up to 12 words — without reference to S∗ — scored around 0.6. Images generated using longer descriptions of up to 30 words scored around 0.625.

Why it matters: The authors’ method offers a simple way for users of diffusion-based, text-to-image generators to steer the output toward specific attributes of content or style without retraining the model.

We’re thinking: Could this approach be extended to encompass multiple learned vectors and allow users to combine them as they like? That would make it possible to control image generation in even more precise ways.