Recap: Transformer architectures
Transformers are the backbone of today’s generation of models. In the previous set of chapters, we covered not just how the capability of NLP models has transformed over the years but also the internals of the transformer itself (see Chapter 3 and Chapter 4 for details). In this section, we will briefly recap the high-level aspects of the transformer setup and then build upon that in the remaining chapter. Figure 5.1 provides a high-level schematic that we will go through step by step.

Figure 5.1: A recap of: A) the internals of a transformer architecture, B) the three main architectural variants of the transformer models, C) the two-step training paradigm showcasing pretraining followed by fine-tuning
Transformers are complex models built like LEGO blocks using multiple smart and specialized components. Figure 5.1 (A) presents the internals of this setup and shows the key components. Briefly, a vanilla transformer model consists...