SsciBERT: a pre-trained language model for social science texts

Shen, Si; Liu, Jiangfeng; Lin, Litao; Huang, Ying; Zhang, Lin; Liu, Chang; Feng, Yutong; Wang, Dongbo

doi:10.1007/s11192-022-04602-4

SsciBERT: a pre-trained language model for social science texts

Published: 17 December 2022

Volume 128, pages 1241–1263, (2023)
Cite this article

Scientometrics Aims and scope Submit manuscript

Si Shen ORCID: orcid.org/0000-0002-6990-410X¹,
Jiangfeng Liu²,
Litao Lin²,
Ying Huang^3,4,
Lin Zhang^3,4,
Chang Liu²,
Yutong Feng² &
…
Dongbo Wang²

1711 Accesses
Explore all metrics

Abstract

The academic literature of social sciences records human civilization and studies human social problems. With its large-scale growth, the ways to quickly find existing research on relevant issues have become an urgent demand for researchers. Previous studies, such as SciBERT, have shown that pre-training using domain-specific texts can improve the performance of natural language processing tasks. However, the pre-trained language model for social sciences is not available so far. In light of this, the present research proposes a pre-trained model based on the abstracts published in the Social Science Citation Index (SSCI) journals. The models, which are available on GitHub (https://github.com/S-T-Full-Text-Knowledge-Mining/SSCI-BERT), show excellent performance on discipline classification, abstract structure–function recognition, and named entity recognition tasks with the social sciences literature.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

MatSciBERT: A materials domain language model for text mining and information extraction

Article Open access 03 May 2022

ABCD Team at SOMD 2024: Software Mention Detection in Scholarly Publications with Large Language Models

Journal article classification using abstracts: a comparison of classical and transformer-based machine learning methods

Article 26 December 2024

References

Asada, M., Miwa, M., & Sasaki, Y. (2020). Using drug descriptions and molecular structures for drug–drug interaction extraction from literature. Bioinformatics, 37(12), 1739–1746. https://doi.org/10.1093/bioinformatics/btaa907
Article Google Scholar
Beltagy, I., Lo, K., & Cohan, A. (2019). SciBERT: A pretrained language model for scientific text. Paper presented at the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP 2019), Hong Kong.
Bengio, Y., Ducharme, R., & Vincent, P. (2000). A neural probabilistic language model. Paper presented at the Neural Information Processing Systems 2000 (NIPS 2000), Denver, Colorado.
Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5, 135–146. https://doi.org/10.1162/tacl_a_00051
Article Google Scholar
Brack, A., D’Souza, J., Hoppe, A., Auer, S., & Ewerth, R. (2020). Domain-independent extraction of scientific concepts from research articles. In G. Kazai & A. R. Fuhr (Eds.), Advances in information retrieval (pp. 251–266). Springer.
Chapter Google Scholar
Cattan, A., Johnson, S., Weld, D., Dagan, I., Beltagy, I., Downey, D., & Hope, T. (2021). SciCo: Hierarchical cross-document conference for scientific concepts. Paper presented at the 3rd Conference on Automated Knowledge Base Construction (AKBC 2021), Irvine.
Chalkidis, I., Fergadiotis, M., Malakasiotis, P., Aletras, N., & Androutsopoulos, I. (2020). LEGAL-BERT: The muppets straight out of law school. Paper presented at the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020), Online.
Chen, S. F., Beeferman, D., & Rosenfeld, R. (1998). Evaluation metrics for language models (pp. 2–8). Paper presented at the Workshop of DARPA Broadcast News Transcription and Understanding.
D’Souza, J., Auer, S., & Pedersen, T. (2021, August). SemEval-2021 Task 11: NLPContributionGraph—Structuring scholarly NLP contributions for a research knowledge graph. Paper presented at the 15th International Workshop on Semantic Evaluation (SemEval-2021), Online.
D’Souza, J., Hoppe, A., Brack, A., Jaradeh, M. Y., Auer, S., & Ewerth, R. (2020, May). The STEM-ECR dataset: Grounding scientific entity references in STEM scholarly content to authoritative encyclopedic and lexicographic sources. Paper presented at the 12th Language Resources and Evaluation Conference (LREC 2020), Marseille.
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. Paper presented at the 17th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2019), Minneapolis, Minnesota.
Dong, Q., Wan, X., & Cao, Y. (2021, April). ParaSCI: A Large scientific paraphrase dataset for longer paraphrase generation. Paper presented at the 16th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2021), Online.
Ferreira, D., & Freitas, A. (2020, May). Natural language premise selection: Finding supporting statements for mathematical text. Paper presented at the 12th Language Resources and Evaluation Conference (LREC 2020), Marseille.
Friedrich, A., Adel, H., Tomazic, F., Hingerl, J., Benteau, R., Marusczyk, A., & Lange, L. (2020). The SOFC-Exp corpus and neural approaches to information extraction in the materials science domain. Paper presented at the 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020), Online.
Graetz, N. (1982). Teaching EFL students to extract structural information from abstracts. Paper presented at the International Symposium on Language for Special Purposes, Eindhoven.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. Paper presented at the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), Las Vegas, Nevada.
Hebbar, S., & Xie, Y. (2021, 04/18). CovidBERT-biomedical relation extraction for covid-19. Paper presented at the Florida Artificial Intelligence Research Society Conference, North Miami Beach, Florida.
Huang, K.-H., Yang, M., & Peng, N. (2020). Biomedical event extraction with hierarchical knowledge graphs. Paper presented at the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020), Online.
Kononova, O., He, T., Huo, H., Trewartha, A., Olivetti, E. A., & Ceder, G. (2021). Opportunities and challenges of text mining in materials research. iScience, 24(3), 102155. https://doi.org/10.1016/j.isci.2021.102155
Article Google Scholar
Kotonya, N., & Toni, F. (2020). Explainable automated fact-checking for public health claims. Paper presented at the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online.
Koutsikakis, J., Chalkidis, I., Malakasiotis, P., & Androutsopoulos, I. (2020). GREEK-BERT: The Greeks visiting Sesame Street. Paper presented at the 11th Hellenic Conference on Artificial Intelligence (SETN 2020), Athens.
Kuniyoshi, F., Makino, K., Ozawa, J., & Miwa, M. (2020). Annotating and Extracting Synthesis Process of All-Solid-State Batteries from Scientific Literature. Paper presented at the 12th Language Resources and Evaluation Conference (LREC 2020), Marseille.
Lauscher, A., Ko, B., Kuehl, B., Johnson, S., Jurgens, D., Cohan, A., & Lo, K. (2021). MultiCite: Modeling realistic citations requires moving beyond the single-sentence single-label setting. Preprint at http://arXiv.org/2107.00414.
Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C. H., & Kang, J. (2019). BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4), 1234–1240. https://doi.org/10.1093/bioinformatics/btz682
Article Google Scholar
Medić, Z., & Šnajder, J. (2020). A survey of citation recommendation tasks and methods. Journal of Computing and Information Technology, 28(3), 183–205. https://doi.org/10.20532/cit.2020.1005160
Article Google Scholar
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. Preprint at http://arXiv.org/1301.3781.
Muraina, I. (2022). Ideal dataset splitting ratios in machine learning algorithms: general concerns for data scientists and data analysts.
Murty, S., Koh, P. W., & Liang, P. (2020, July). ExpBERT: Representation engineering with natural language explanations, Online.
Nicholson, J. M., Mordaunt, M., Lopez, P., Uppala, A., Rosati, D., Rodrigues, N. P., Grabitz, P., & Rife, S. C. (2021). Scite: A smart citation index that displays the context of citations and classifies their intent using deep learning. Quantitative Science Studies, 2(3), 882–898. https://doi.org/10.1162/qss_a_00146
Article Google Scholar
Park, S., & Caragea, C. (2020). Scientific keyphrase identification and classification by pre-trained language models intermediate task transfer learning. Paper presented at the 28th International Conference on Computational Linguistics (COLING’2020), Barcelona (Online).
Pennington, J., Socher, R., & Manning, C. (2014). GloVe: Global vectors for word representation. Paper presented at the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP 2014), Doha.
Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. Louisiana.
Book Google Scholar
Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving language understanding by generative pre-training. Retrieved from https://www.cs.ubc.ca/~amuham01/LING530/papers/radford2018improving.pdf
Rasmy, L., Xiang, Y., Xie, Z., Tao, C., & Zhi, D. (2021). Med-BERT: Pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction. NPJ Digital Medicine, 4(1), 86. https://doi.org/10.1038/s41746-021-00455-y
Article Google Scholar
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition.https://doi.org/10.48550/arXiv.1409.1556
Sollaci, L. B., & Pereira, M. G. (2004). The introduction, methods, results, and discussion (IMRAD) structure: A fifty-year survey. Journal of the Medical Library Association, 92(3), 364–367.
Google Scholar
Swales, J. (1990). Genre analysis: English in academic and research settings. Cambridge University Press.
Google Scholar
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N.,…, Polosukhin, I. (2017). Attention is all you need. Paper presented at the The 31 Annual Conference on Neural Information Processing Systems (NIPS), Long Beach, California.
van Dongen, T., Maillette de Buy Wenniger, G., & Schomaker, L. (2020, November). SChuBERT: Scholarly document chunks with BERT-encoding boost citation count prediction. Paper presented at the 1st Workshop on Scholarly Document Processing (SDP 2020), Online.
Viswanathan, V., Neubig, G., & Liu, P. (2021, August). CitationIE: Leveraging the citation graph for scientific information extraction. Paper presented at the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP 2021)), Online.
Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A.,..., & Rush, A. M. (2019). Huggingface's transformers: State-of-the-art natural language processing. Preprint at http://arXiv.org/1910.03771.
Wright, D., & Augenstein, I. (2021). CiteWorth: Cite-worthiness detection for improved scientific document understanding. Paper presented at the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP 2021), Online.
Yang, Y., Siy U. Y., & Huang, A. (2020). FinBERT: A pretrained language model for financial communications. https://doi.org/10.48550/arXiv.2006.08097

Download references

Acknowledgements

The authors acknowledge the National Natural Science Foundation of China (Grant Numbers: 71974094, 72004169) for financial support and the data annotation team of Nanjing Agricultural University and Nanjing University of Science and Technology. Thanks to the students and researchers for their help in the revision and polishing of the paper.

Author information

Authors and Affiliations

Group of Science and Technology Full-Text Knowledge Mining, School of Economics & Management, Nanjing University of Science and Technology, Nanjing, 210094, China
Si Shen
College of Information Management, Nanjing Agricultural University, Nanjing, 210095, China
Jiangfeng Liu, Litao Lin, Chang Liu, Yutong Feng & Dongbo Wang
Center for Science, Technology & Education Assessment (CSTEA), Wuhan University, Wuhan, 430072, China
Ying Huang & Lin Zhang
Center for Studies of Information Resources, School of Information Management, Wuhan University, Wuhan, 430072, China
Ying Huang & Lin Zhang

Authors

Si Shen
View author publications
You can also search for this author inPubMed Google Scholar
Jiangfeng Liu
View author publications
You can also search for this author inPubMed Google Scholar
Litao Lin
View author publications
You can also search for this author inPubMed Google Scholar
Ying Huang
View author publications
You can also search for this author inPubMed Google Scholar
Lin Zhang
View author publications
You can also search for this author inPubMed Google Scholar
Chang Liu
View author publications
You can also search for this author inPubMed Google Scholar
Yutong Feng
View author publications
You can also search for this author inPubMed Google Scholar
Dongbo Wang
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Si Shen.

Ethics declarations

Conflict of interest

The authors declared that they have no conflict of interest to this work.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Shen, S., Liu, J., Lin, L. et al. SsciBERT: a pre-trained language model for social science texts. Scientometrics 128, 1241–1263 (2023). https://doi.org/10.1007/s11192-022-04602-4

Download citation

Received: 11 June 2022
Accepted: 24 November 2022
Published: 17 December 2022
Issue Date: February 2023
DOI: https://doi.org/10.1007/s11192-022-04602-4

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SsciBERT: a pre-trained language model for social science texts

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

MatSciBERT: A materials domain language model for text mining and information extraction

ABCD Team at SOMD 2024: Software Mention Detection in Scholarly Publications with Large Language Models

Journal article classification using abstracts: a comparison of classical and transformer-based machine learning methods

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now