DistilBERT in action
The transformer architecture ushered in completely unheard-of performance benchmarks in the NLP domain. One of the initial and most successful transformer architectures was the BERT model. BERT, or Bi-Directional Encoder Representations from Transformers, was presented by Devlin et al., a team at Google AI in 20184.
BERT also helped push the transfer-learning envelope in the NLP domain by showcasing how a pretrained model can be fine-tuned for various tasks, providing state-of-the-art performance. BERT makes use of a transformer-style encoder with a different number of encoder blocks, depending on the model size. The authors presented two models, BERT-base with 12 blocks and BERT-large with 24 blocks. Both of these models have larger feedforward networks (768 and 1,024, respectively) and a greater number of attention heads (12 and 16, respectively) compared to the original transformer setup.
Another major change from the original transformer implementation...