The LLaMA models
The LLaMA family of models2-3 is a set of open-source LLMs developed by Meta; the latest general language model in this family is LLaMA3. In introducing this model3, the development team highlighted a few key architectural features:
- It is a variant of the GPT-3/Palm models4 that heavily utilize transformer units, which we’ve seen in earlier chapters.
- It makes use of Root Mean Square (RMS) Norm layers on the inputs to the model, which helps manage the magnitude of gradients5; this normalization has more commonly been applied to the outputs of the transformer modules in LLMs.
- The SwiGLU activation function we saw in Chapter 2.
- Rotary Positional Embeddings, a method of representing the relative position of input characters (i.e., how close they are to each other) in a flexible way6; it makes use of the inner product between embedded tokens that is efficient to compute in the transformer module.
- The AdamW optimizer we saw in Chapter...