Self-attention
Self-attention was proposed by Cheng et al. in their paper titled Long Short-Term Memory Networks for Machine Reading in 20161. The concept of self-attention builds upon the general idea of attention. Self-attention enables a model to learn the correlation between the current token (character, word, sentence, etc.) and its context window. In other words, it is an attention mechanism that relates different positions of a given sequence to generate a representation of the same sequence. Imagine this as a way of transforming word embeddings in the context of the given sentence/sequence. The concept of self-attention as presented in the original paper itself is depicted in Figure 4.2.

Figure 4.2: Self-attention (source: Cheng et al.)
Let us try and understand the self-attention output presented in Figure 4.2. Each row/sentence represents the state of the model at every time step, with the current word highlighted in red. Blue represents the attention of the...