Attention
A typical RNN layer (generally speaking, it could be LSTM, GRU, etc.) takes in a context window of a defined size as input and encodes all of it into a single vector (say, for the task of language modeling). This bottleneck vector needs to capture a lot of information in itself before the decoding stage can use it to start generating the next token. This led to a number of challenges related to various NLP tasks, such as language translation, question-answering, and more.
Attention is one of the most powerful concepts in the deep learning space that really changed the game. The core idea behind the attention mechanism is to make use of all interim hidden states of the RNN (as we’ll see, this extends to other architectures as well) to decide which one to focus upon before it is used at the decoding stage. A more formal way of presenting attention is:
Given a vector of values (all the hidden states of the RNN) and a query vector (this could be the decoder...