Transformers: “The End of History” for Natural Language Processing?

Chernyavskiy, Anton; Ilvovsky, Dmitry; Nakov, Preslav

doi:10.1007/978-3-030-86523-8_41

Anton Chernyavskiy¹³,
Dmitry Ilvovsky¹³ &
Preslav Nakov¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12977))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

3225 Accesses
27 Citations

Abstract

Recent advances in neural architectures, such as the Transformer, coupled with the emergence of large-scale pre-trained models such as BERT, have revolutionized the field of Natural Language Processing (NLP), pushing the state of the art for a number of NLP tasks. A rich family of variations of these models has been proposed, such as RoBERTa, ALBERT, and XLNet, but fundamentally, they all remain limited in their ability to model certain kinds of information, and they cannot cope with certain information sources, which was easy for pre-existing models. Thus, here we aim to shed light on some important theoretical limitations of pre-trained BERT-style models that are inherent in the general Transformer architecture. First, we demonstrate in practice on two general types of tasks—segmentation and segment labeling—and on four datasets that these limitations are indeed harmful and that addressing them, even in some very simple and naïve ways, can yield sizable improvements over vanilla RoBERTa and XLNet models. Then, we offer a more general discussion on desiderata for future additions to the Transformer architecture that would increase its expressiveness, which we hope could help in the design of the next generation of deep NLP architectures.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Pre-trained models for natural language processing: A survey

Article 15 September 2020

Deep Learning and Its Applications to Natural Language Processing

A Survey of Pretrained Language Models

Notes

1.
A notable previous promising attempt was ELMo [21], but it became largely outdated in less than a year.
2.
http://en.wikipedia.org/wiki/The_End_of_History_and_the_Last_Man.
3.
Some solutions were proposed such as Longformer [3], Performer [4], Linformer [33], Linear Transformer [15], and Big Bird [35].
4.
The official task webpage: http://propaganda.qcri.org/semeval2020-task11/.
5.
http://github.com/huggingface/transformers.

References

Arkhipov, M., Trofimova, M., Kuratov, Y., Sorokin, A.: Tuning multilingual transformers for language-specific named entity recognition. In: Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing (BSNLP 2019), pp. 89–93. Florence, Italy (2019)
Google Scholar
Augenstein, I., Das, M., Riedel, S., Vikraman, L., McCallum, A.: SemEval 2017 task 10: ScienceIE - extracting keyphrases and relations from scientific publications. In: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval 2017), pp. 546–555. Vancouver, Canada (2017)
Google Scholar
Beltagy, I., Peters, M.E., Cohan, A.: Longformer: the long-document transformer. In: ArXiv (2020)
Google Scholar
Choromanski, K., et al.: Rethinking attention with performers. In: Proceedings of the 9th International Conference on Learning Representations (ICLR 2021) (2021)
Google Scholar
Clark, K., Khandelwal, U., Levy, O., Manning, C.D.: What does BERT look at? An analysis of BERT’s attention, ArXiv (2019)
Google Scholar
Da San Martino, G., Barrón-Cedeño, A., Wachsmuth, H., Petrov, R., Nakov, P.: SemEval-2020 task 11: detection of propaganda techniques in news articles. In: Proceedings of the 14th International Workshop on Semantic Evaluation (SemEval 2020), Barcelona, Spain (2020)
Google Scholar
Da San Martino, G., Yu, S., Barrón-Cedeño, A., Petrov, R., Nakov, P.: Fine-grained analysis of propaganda in news article. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP 2019), pp. 5636–5646. Hong Kong, China (2019)
Google Scholar
Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q., Salakhutdinov, R.: TransformerXL: attentive language models beyond a fixed-length context. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL 2019), pp. 2978–2988. Florence, Italy (2019)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2019), pp. 4171–4186. Minneapolis, MN, USA (2019)
Google Scholar
Durrani, N., Dalvi, F., Sajjad, H., Belinkov, Y., Nakov, P.: One size does not fit all: Comparing NMT representations of different granularities. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2019), pp. 1504–1516. Minneapolis, MN, USA (2019)
Google Scholar
Ettinger, A.: What BERT is not: lessons from a new suite of psycholinguistic diagnostics for language models. Trans. Assoc. Comput. Linguist. 8, 34–48 (2020)
Article Google Scholar
Goldberg, Y.: Assessing bert’s syntactic abilities (2019)
Google Scholar
Jawahar, G., Sagot, B., Seddah, D.: What does BERT learn about the structure of language? In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL 2019), pp. 3651–3657. Florence, Italy (2019)
Google Scholar
Jin, D., Jin, Z., Zhou, J.T., Szolovits, P.: Is BERT really robust? a strong baseline for natural language attack on text classification and entailment. In: Proceedings of the 34th Conference on Artificial Intelligence (AAAI 2020), pp. 8018–8025 (2019)
Google Scholar
Katharopoulos, A., Vyas, A., Pappas, N., Fleuret, F.: Transformers are RNNs: fast autoregressive transformers with linear attention. In: Proceedings of the 37th International Conference on Machine Learning (ICML 2020), pp. 5156–5165 (2020)
Google Scholar
Kovaleva, O., Romanov, A., Rogers, A., Rumshisky, A.: Revealing the dark secrets of BERT. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing and the International Joint Conference on Natural Language Processing (EMNLP-IJCNLP 2019), pp. 4365–4374. Hong Kong, China (2019)
Google Scholar
Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning (ICML 2001), pp. 282–289. Williamstown, MA, USA (2001)
Google Scholar
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: ALBERT: a lite BERT for self-supervised learning of language representations. In: ArXiv (2019)
Google Scholar
Liu, N.F., Gardner, M., Belinkov, Y., Peters, M.E., Smith, N.A.: Linguistic knowledge and transferability of contextual representations. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2019), pp. 1073–1094. Minneapolis, MN, USA (2019)
Google Scholar
Liu, Y., et al.: RoBERTa: A robustly optimized BERT pretraining approach. In: ArXiv (2019)
Google Scholar
Peters, M., et al.: Deep contextualized word representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2018), pp. 2227–2237. New Orleans, LA, USA (2018)
Google Scholar
Peters, M.E., et al.: Knowledge enhanced contextual word representations. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP 2019), pp. 43–54. Hong Kong, China (2019)
Google Scholar
Popel, M., Bojar, O.: Training tips for the transformer model. Prague Bull. Math. Linguist. 110(1), 43–70 (2018)
Article Google Scholar
Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
Article Google Scholar
Ratinov, L.A., Roth, D.: Design challenges and misconceptions in named entity recognition. In: Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL 2009), pp. 147–155. Boulder, CO, USA (2009)
Google Scholar
Rogers, A., Kovaleva, O., Rumshisky, A.: A primer in BERTology: what we know about how BERT works. Trans. Assoc. Comput. Linguist. 8, 842–866 (2020)
Article Google Scholar
Sanh, V., Debut, L., Chaumond, J., Wolf, T.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. In: ArXiv (2019)
Google Scholar
Souza, F., Nogueira, R., Lotufo, R.: Portuguese named entity recognition using BERT-CRF. In: Arxiv (2019)
Google Scholar
Sun, L., et al.: Adv-BERT: BERT is not robust on misspellings! generating nature adversarial samples on BERT. In: Arxiv (2020)
Google Scholar
Tenney, I., et al.: What do you learn from context? Probing for sentence structure in contextualized word representations. In: Arxiv (2019)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Arxiv (2017)
Google Scholar
Wallace, E., Wang, Y., Li, S., Singh, S., Gardner, M.: Do NLP models know numbers? probing numeracy in embeddings. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP 2019), pp. 5307–5315. Hong Kong, China (2019)
Google Scholar
Wang, S., Li, B.Z., Khabsa, M., Fang, H., Ma, H.: Linformer: self-attention with linear complexity. In: Arxiv (2020)
Google Scholar
Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R.R., Le, Q.V.: XLNet: generalized autoregressive pretraining for language understanding. In: Proceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS 2019), pp. 5753–5763 (2019)
Google Scholar
Zaheer, M., et al.: Big bird: transformers for longer sequences. In: Proceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS 2020) (2020)
Google Scholar

Download references

Acknowledgments

Anton Chernyavskiy and Dmitry Ilvovsky performed this research within the framework of the HSE University Basic Research Program.

Preslav Nakov contributed as part of the Tanbih mega-project (http://tanbih.qcri.org/), which is developed at the Qatar Computing Research Institute, HBKU, and aims to limit the impact of “fake news,” propaganda, and media bias by making users aware of what they are reading.

Author information

Authors and Affiliations

HSE University, Moscow, Russian Federation
Anton Chernyavskiy & Dmitry Ilvovsky
Qatar Computing Research Institute, HBKU, Doha, Qatar
Preslav Nakov

Authors

Anton Chernyavskiy
View author publications
You can also search for this author in PubMed Google Scholar
Dmitry Ilvovsky
View author publications
You can also search for this author in PubMed Google Scholar
Preslav Nakov
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Anton Chernyavskiy .

Editor information

Editors and Affiliations

ELLIS - The European Laboratory for Learning and Intelligent Systems, Alicante, Spain
Nuria Oliver
ETHZ and EPFL, Zürich, Switzerland
Fernando Pérez-Cruz
Johannes Gutenberg University of Mainz, Mainz, Germany
Stefan Kramer
École Polytechnique, Palaiseau, France
Jesse Read
Basque Center for Applied Mathematics, Bilbao, Spain
Jose A. Lozano

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chernyavskiy, A., Ilvovsky, D., Nakov, P. (2021). Transformers: “The End of History” for Natural Language Processing?. In: Oliver, N., Pérez-Cruz, F., Kramer, S., Read, J., Lozano, J.A. (eds) Machine Learning and Knowledge Discovery in Databases. Research Track. ECML PKDD 2021. Lecture Notes in Computer Science(), vol 12977. Springer, Cham. https://doi.org/10.1007/978-3-030-86523-8_41

Download citation

DOI: https://doi.org/10.1007/978-3-030-86523-8_41
Published: 11 September 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-86522-1
Online ISBN: 978-3-030-86523-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the ECML PKDD community (opens in a new tab)