Mixtral
Another family of popular open-source LLMs was developed by the French firm Mistral.ai. Because it has a permissive 2.0 license from the Apache software foundation, it is a good tool for experimentation and even potential commercial use. We described how the LLaMA family of LLMs uses the GPT-2 type transformer architecture. While it also uses transformers as a module in the LLM, Mistral’s latest model, Mixtral, is based on the Mixture of Experts (MoE) architecture10. In a MoE model, the input (user prompt text) is encoded in a vectorized embedding as in LLaMA and other similar models. However, this architecture then introduces a router (Figure 6.3), which routes each input token into a subset (here, 2 of 8), experts, or sets of transformer layers in the model.

Figure 6.3: The Mixture of Experts architecture10
Mathematically, MoE calculates the top 2 softmax scores over the 8 experts for each token:
Where Wg is the weight matrix for the “...