Abstract
Online social media yields a large-scale corpora which is fairly informative and sometimes includes many up-to-date entities. The challenging task of expanding entity sets on social media text is to extract more unheard entities with several seeds already in hand. In this paper, we present a novel approach that is able to discover newly-presented objects by doing entity set expansion on social media. From an initial seed set, our method first explores the performance of embedding method to get semantic similarity feature when generating candidate lists, and detects features of connective patterns and prefix rules with specific social media nature. Then a rank model is learned by supervised algorithm to synthetically score each candidate terms on those features and finally give the final ranked set. The experimental results on Twitter text corpus show that our solution is able to achieve high precision on common class sets, and new class sets containing abundant informal and new entities that have not been mentioned in common articles.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Wang, R.C., Cohen, W.W.: Language-independent set expansion of named entities using the web. In: IEEE International Conference on Data Mining, pp. 342–350. IEEE (2007)
Wang, R.C., Cohen, W.W.: Iterative set expansion of named entities using the web. In: Eighth IEEE International Conference on Data Mining, pp. 1091–1096. IEEE (2009)
Wang, R.C., Cohen, W.W.: SEAL. http://rcwang.com/seal
He, Y., Xin, D.: SEISA: Set Expansion by Iterative Similarity Aggregation. In: International Conference on World Wide Web, WWW 2011, Hyderabad, India, pp. 427–436 (2011)
Dalvi, B.B., Cohen, W.W., Callan, J.: WebSets: extracting sets of entities from the web using unsupervised information extraction. In: ACM International Conference on Web Search and Data Mining, pp. 243–252. ACM (2012)
Thelen, M., Riloff, E.: A bootstrapping method for learning semantic lexicons using extraction pattern contexts. In: Conference on Empirical Methods in Natural Language Processing, ACL 2002, pp. 212–221. ACL (2002)
Wang, R.C., Cohen, W.W.: Automatic set instance extraction using the web. In: ACL 2009, Proceedings of the Meeting of the Association for Computational Linguistics and the International Joint Conference on Natural Language Processing of the AFNLP, pp. 441–449. ACL, Singapore (2009)
Etzioni, O., Cafarella, M., Downey, D., Kok, S., Popescu, A.M., Shaked, T., Soderland, S., Weld, D.S., Yates, A.: Web-scale information extraction in KnowItAll. In: WWW, pp. 100–110 (2004)
Widdows, D., Dorow, B.: A graph model for unsupervised lexical acquisition. In: International Conference on Computational Linguistics, pp. 1093–1099 (2002)
Sarmento, L., Jijkuon, V., De Rijke, M., Oliveira, E.: More like these: growing entity classes from seeds. In: Sixteenth ACM Conference on Conference on Information and Knowledge Management, pp. 959–962. ACM (2007)
Talukdar, P.P., Brants, T., Liberman, M., Pereira, F.: A context pattern induction method for named entity extraction. In: Computational Natural Language Learning, CoNLL-X, pp. 141–148 (2006)
Ghahramani, Z., Heller, K.A.: Bayesian sets (2005)
Li, X.L., Zhang, L., Liu, B., Ng, S.K.: Distributional similarity vs. PU learning for entity set expansion. In: ACL 2010 Conference Short Papers, pp. 359–364. ACL (2010)
Ritter, A., Sam, C., Mausam, Etzioni, O.: Named entity recognition in tweets (2011)
Li, C., Weng, J., He, Q., Yao, Y., Datta, A., Sun, A., Bu-Sung, L.: TwiNER: named entity recognition in targeted twitter stream. In: Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 721–730 (2012)
Bontcheva, K., Derczynski, L., Funk, A., Greenwood, M.A., Maynard, D., Aswani, N.: TwitIE: an open-source information extraction pipeline for microblog text. In: Proceedings of the International Conference on Recent Advances in Natural Language Processing. Association for Computational Linguistics (2013)
Qadir, A., Mendes, P.N., Gruhl, D., Lewis, N.: Semantic lexicon induction from twitter with pattern relatedness and flexible term length. In: Twenty-Ninth AAAI Conference on Artificial Intelligence, pp. 2432–2439 (2015)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. Comput. Sci. (2013)
Xu, J., Li, H.: AdaRank: a boosting algorithm for information retrieval. In: International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 391–398 (2007)
Acknowledgements
This work was supported by the National High-tech Research and Development Program (863 Program) (No. 2014AA015105) and National Natural Science Foundation of China (No. 61602490).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Zhao, H., Feng, C., Luo, Z., Pei, Y. (2017). Entity Set Expansion on Social Media: A Study for Newly-Presented Entity Classes. In: Cheng, X., Ma, W., Liu, H., Shen, H., Feng, S., Xie, X. (eds) Social Media Processing. SMP 2017. Communications in Computer and Information Science, vol 774. Springer, Singapore. https://doi.org/10.1007/978-981-10-6805-8_10
Download citation
DOI: https://doi.org/10.1007/978-981-10-6805-8_10
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-6804-1
Online ISBN: 978-981-10-6805-8
eBook Packages: Computer ScienceComputer Science (R0)