Abstract
The academic literature of social sciences records human civilization and studies human social problems. With its large-scale growth, the ways to quickly find existing research on relevant issues have become an urgent demand for researchers. Previous studies, such as SciBERT, have shown that pre-training using domain-specific texts can improve the performance of natural language processing tasks. However, the pre-trained language model for social sciences is not available so far. In light of this, the present research proposes a pre-trained model based on the abstracts published in the Social Science Citation Index (SSCI) journals. The models, which are available on GitHub (https://github.com/S-T-Full-Text-Knowledge-Mining/SSCI-BERT), show excellent performance on discipline classification, abstract structure–function recognition, and named entity recognition tasks with the social sciences literature.



Similar content being viewed by others
References
Asada, M., Miwa, M., & Sasaki, Y. (2020). Using drug descriptions and molecular structures for drug–drug interaction extraction from literature. Bioinformatics, 37(12), 1739–1746. https://doi.org/10.1093/bioinformatics/btaa907
Beltagy, I., Lo, K., & Cohan, A. (2019). SciBERT: A pretrained language model for scientific text. Paper presented at the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP 2019), Hong Kong.
Bengio, Y., Ducharme, R., & Vincent, P. (2000). A neural probabilistic language model. Paper presented at the Neural Information Processing Systems 2000 (NIPS 2000), Denver, Colorado.
Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5, 135–146. https://doi.org/10.1162/tacl_a_00051
Brack, A., D’Souza, J., Hoppe, A., Auer, S., & Ewerth, R. (2020). Domain-independent extraction of scientific concepts from research articles. In G. Kazai & A. R. Fuhr (Eds.), Advances in information retrieval (pp. 251–266). Springer.
Cattan, A., Johnson, S., Weld, D., Dagan, I., Beltagy, I., Downey, D., & Hope, T. (2021). SciCo: Hierarchical cross-document conference for scientific concepts. Paper presented at the 3rd Conference on Automated Knowledge Base Construction (AKBC 2021), Irvine.
Chalkidis, I., Fergadiotis, M., Malakasiotis, P., Aletras, N., & Androutsopoulos, I. (2020). LEGAL-BERT: The muppets straight out of law school. Paper presented at the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020), Online.
Chen, S. F., Beeferman, D., & Rosenfeld, R. (1998). Evaluation metrics for language models (pp. 2–8). Paper presented at the Workshop of DARPA Broadcast News Transcription and Understanding.
D’Souza, J., Auer, S., & Pedersen, T. (2021, August). SemEval-2021 Task 11: NLPContributionGraph—Structuring scholarly NLP contributions for a research knowledge graph. Paper presented at the 15th International Workshop on Semantic Evaluation (SemEval-2021), Online.
D’Souza, J., Hoppe, A., Brack, A., Jaradeh, M. Y., Auer, S., & Ewerth, R. (2020, May). The STEM-ECR dataset: Grounding scientific entity references in STEM scholarly content to authoritative encyclopedic and lexicographic sources. Paper presented at the 12th Language Resources and Evaluation Conference (LREC 2020), Marseille.
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. Paper presented at the 17th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2019), Minneapolis, Minnesota.
Dong, Q., Wan, X., & Cao, Y. (2021, April). ParaSCI: A Large scientific paraphrase dataset for longer paraphrase generation. Paper presented at the 16th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2021), Online.
Ferreira, D., & Freitas, A. (2020, May). Natural language premise selection: Finding supporting statements for mathematical text. Paper presented at the 12th Language Resources and Evaluation Conference (LREC 2020), Marseille.
Friedrich, A., Adel, H., Tomazic, F., Hingerl, J., Benteau, R., Marusczyk, A., & Lange, L. (2020). The SOFC-Exp corpus and neural approaches to information extraction in the materials science domain. Paper presented at the 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020), Online.
Graetz, N. (1982). Teaching EFL students to extract structural information from abstracts. Paper presented at the International Symposium on Language for Special Purposes, Eindhoven.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. Paper presented at the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), Las Vegas, Nevada.
Hebbar, S., & Xie, Y. (2021, 04/18). CovidBERT-biomedical relation extraction for covid-19. Paper presented at the Florida Artificial Intelligence Research Society Conference, North Miami Beach, Florida.
Huang, K.-H., Yang, M., & Peng, N. (2020). Biomedical event extraction with hierarchical knowledge graphs. Paper presented at the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020), Online.
Kononova, O., He, T., Huo, H., Trewartha, A., Olivetti, E. A., & Ceder, G. (2021). Opportunities and challenges of text mining in materials research. iScience, 24(3), 102155. https://doi.org/10.1016/j.isci.2021.102155
Kotonya, N., & Toni, F. (2020). Explainable automated fact-checking for public health claims. Paper presented at the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online.
Koutsikakis, J., Chalkidis, I., Malakasiotis, P., & Androutsopoulos, I. (2020). GREEK-BERT: The Greeks visiting Sesame Street. Paper presented at the 11th Hellenic Conference on Artificial Intelligence (SETN 2020), Athens.
Kuniyoshi, F., Makino, K., Ozawa, J., & Miwa, M. (2020). Annotating and Extracting Synthesis Process of All-Solid-State Batteries from Scientific Literature. Paper presented at the 12th Language Resources and Evaluation Conference (LREC 2020), Marseille.
Lauscher, A., Ko, B., Kuehl, B., Johnson, S., Jurgens, D., Cohan, A., & Lo, K. (2021). MultiCite: Modeling realistic citations requires moving beyond the single-sentence single-label setting. Preprint at http://arXiv.org/2107.00414.
Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C. H., & Kang, J. (2019). BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4), 1234–1240. https://doi.org/10.1093/bioinformatics/btz682
Medić, Z., & Šnajder, J. (2020). A survey of citation recommendation tasks and methods. Journal of Computing and Information Technology, 28(3), 183–205. https://doi.org/10.20532/cit.2020.1005160
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. Preprint at http://arXiv.org/1301.3781.
Muraina, I. (2022). Ideal dataset splitting ratios in machine learning algorithms: general concerns for data scientists and data analysts.
Murty, S., Koh, P. W., & Liang, P. (2020, July). ExpBERT: Representation engineering with natural language explanations, Online.
Nicholson, J. M., Mordaunt, M., Lopez, P., Uppala, A., Rosati, D., Rodrigues, N. P., Grabitz, P., & Rife, S. C. (2021). Scite: A smart citation index that displays the context of citations and classifies their intent using deep learning. Quantitative Science Studies, 2(3), 882–898. https://doi.org/10.1162/qss_a_00146
Park, S., & Caragea, C. (2020). Scientific keyphrase identification and classification by pre-trained language models intermediate task transfer learning. Paper presented at the 28th International Conference on Computational Linguistics (COLING’2020), Barcelona (Online).
Pennington, J., Socher, R., & Manning, C. (2014). GloVe: Global vectors for word representation. Paper presented at the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP 2014), Doha.
Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. Louisiana.
Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving language understanding by generative pre-training. Retrieved from https://www.cs.ubc.ca/~amuham01/LING530/papers/radford2018improving.pdf
Rasmy, L., Xiang, Y., Xie, Z., Tao, C., & Zhi, D. (2021). Med-BERT: Pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction. NPJ Digital Medicine, 4(1), 86. https://doi.org/10.1038/s41746-021-00455-y
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition.https://doi.org/10.48550/arXiv.1409.1556
Sollaci, L. B., & Pereira, M. G. (2004). The introduction, methods, results, and discussion (IMRAD) structure: A fifty-year survey. Journal of the Medical Library Association, 92(3), 364–367.
Swales, J. (1990). Genre analysis: English in academic and research settings. Cambridge University Press.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N.,…, Polosukhin, I. (2017). Attention is all you need. Paper presented at the The 31 Annual Conference on Neural Information Processing Systems (NIPS), Long Beach, California.
van Dongen, T., Maillette de Buy Wenniger, G., & Schomaker, L. (2020, November). SChuBERT: Scholarly document chunks with BERT-encoding boost citation count prediction. Paper presented at the 1st Workshop on Scholarly Document Processing (SDP 2020), Online.
Viswanathan, V., Neubig, G., & Liu, P. (2021, August). CitationIE: Leveraging the citation graph for scientific information extraction. Paper presented at the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP 2021)), Online.
Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A.,..., & Rush, A. M. (2019). Huggingface's transformers: State-of-the-art natural language processing. Preprint at http://arXiv.org/1910.03771.
Wright, D., & Augenstein, I. (2021). CiteWorth: Cite-worthiness detection for improved scientific document understanding. Paper presented at the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP 2021), Online.
Yang, Y., Siy U. Y., & Huang, A. (2020). FinBERT: A pretrained language model for financial communications. https://doi.org/10.48550/arXiv.2006.08097
Acknowledgements
The authors acknowledge the National Natural Science Foundation of China (Grant Numbers: 71974094, 72004169) for financial support and the data annotation team of Nanjing Agricultural University and Nanjing University of Science and Technology. Thanks to the students and researchers for their help in the revision and polishing of the paper.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declared that they have no conflict of interest to this work.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Shen, S., Liu, J., Lin, L. et al. SsciBERT: a pre-trained language model for social science texts. Scientometrics 128, 1241–1263 (2023). https://doi.org/10.1007/s11192-022-04602-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11192-022-04602-4