Deep dive into the text-to-image pipeline
In the previous section, we produced all the examples by providing the prompt and various arguments to the pipeline directly. The pipeline consists of several components that act in sequence to produce images from your prompt. These components are contained in a Python dictionary that is part of the Pipeline
class, and so, like any Python dictionary, you can print the key names of the fields to inspect the components (Figure 15.14).

Figure 15.14: Components of the Stable Diffusion pipeline
We’ve seen each of these in action in the prior examples, as will become clearer as we walk through the execution of each:
- The tokenizer takes our prompt and turns it into a byte representation
- The text encoder takes that byte representation and turns it into a numerical vector
- The U-Net, which takes a vector of random numbers and the encoded prompt and merges them
- The scheduler, which runs diffusion steps...