Abstract
Keyphrase extraction from social media is a crucial and challenging task. Previous studies usually focus on extracting keyphrases that provide the summary of a corpus. However, they do not take users’ specific needs into consideration. In this paper, we propose a novel three-stage model to learn a keyphrase set that represents or related to a particular topic. Firstly, a phrase mining algorithm is applied to segment the documents into human-interpretable phrases. Secondly, we propose a weakly supervised model to extract candidate keyphrases, which uses a few pre-specific seed keyphrases to guide the model. The model consequently makes the extracted keyphrases more specific and related to the seed keyphrases (which reflect the user’s needs). Finally, to further identify the implicitly related phrases, the PMI-IR algorithm is employed to obtain the synonyms of the extracted candidate keyphrases. We conducted experiments on two publicly available datasets from news and Twitter. The experimental results demonstrate that our approach outperforms the state-of-the-art baselines and has the potential to extract high-quality task-oriented keyphrases.

Similar content being viewed by others
Notes
Available at https://www.google.com/advanced_search.
Available at http://qwone.com/~jason/20Newsgroups.
Available at http://www.nltk.org.
Available at http://www.ranks.nl/stopwords.
Available at http://wordnet.princeton.edu/.
References
Agrawal R, Srikant R et al (1994) Fast algorithms for mining association rules. In: Proceedings of the 20th international conference of very large data bases, VLDB, vol 1215, pp 487–499
Arora R, Ravindran B (2008) Latent dirichlet allocation based multi-document summarization. In: Proceedings of the second workshop on analytics for noisy unstructured text data. ACM, pp 91–97
Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022
Chang X, Nie F, Wang S, Yi Y, Zhou X, Zhang C (2016) Compound rank- k projections for bilinear analysis. IEEE Transactions on Neural Networks and Learning Systems 27(7):1502–1513
Chang X, Yu Y-L, Yi Y, Xing EP (2016) Semantic pooling for complex event analysis in untrimmed videos. IEEE Trans Pattern Anal Mach Intell. doi:10.1109/TPAMI.2016.2608901
Chang X, Yi Y (2017) Semisupervised feature analysis by mining correlations among multiple tasks. IEEE Trans Neural Netw Learn Syst. doi:10.1109/TNNLS.2016.2582746
Chang X, Ma Z, Lin M, Yi Y, Hauptmann A (2017) Feature interaction augmented sparse learning for fast kinect motion detection. IEEE Trans Image Process 26(8):3911–3920
Chang X, Ma Z, Yi Y, Zeng Z, Hauptmann AG (2017) Bi-level semantic representation analysis for multimedia event detection. IEEE Trans Cybern 47(5):1180–1197
Chen J, Zhang B, Shen D, Yang Q, Chen Z, Cheng Q (2006) Diverse topic phrase extraction from text collection
Chien L-F (1997) Pat-tree-based keyword extraction for chinese information retrieval. In: ACM SIGIR forum, vol 31. ACM, pp 50–58
Choi Y, Cardie C (2009) Adapting a polarity lexicon using integer linear programming for domain-specific sentiment classification. In: Proceedings of the 2009 conference on empirical methods in natural language processing, vol 2. Association for Computational Linguistics, pp 590–598
El-Kishky A, Song Y, Wang C, Voss CR, Han J (2014) Scalable topical phrase mining from text corpora. Proceedings of the VLDB Endowment 8(3):305–316
Feng X, Huang L, Tang D, Qin B, Ji H, Liu T (2016) A language-independent neural network for event detection. In: The 54th annual meeting of the association for computational linguistics, p 66
Firth JR (1957) A synopsis of linguistic theory, 1930-1955
Frank E, Paynter GW, Witten IH, Gutwin C, Nevill-Manning CG (1999) Domain-specific keyphrase extraction
Gong Y, Liu X (2001) Generic text summarization using relevance measure and latent semantic analysis. In: Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval. ACM, pp 19–25
Lafferty J, McCallum A, Pereira F et al (2001) Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of the eighteenth international conference on machine learning, vol 1, pp 282–289
Li J, Fan Q, Zhang K (2007) Keyword extraction based on tf/idf for chinese news document. Wuhan Univ J Nat Sci 12(5):917–921. doi:10.1007/s11859-007-0038-4
Lott B (2012) Survey of keyword extraction techniques. UNM Education
Ma Z, Chang X, Yi Y, Sebe N, Hauptmann A (2017) The many shades of negativity. IEEE Trans Multimedia 19(7):1558–1568
Neto JL, Santos AD, Kaestner CAA, Alexandre N, Santos D et al (2000) Document clustering and text summarization
Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inf Process Manag 24(5):513–523
Shamma DA, Kennedy L, Churchill EF (2009) Tweet the debates: understanding community annotation of uncollected sources. In: Proceedings of the first SIGMM workshop on social media. ACM, pp 3–10
Tu W, Cheung DW-L, Mamoulis N, Yang M, Lu Z (2015) Real-time detection and sorting of news on microblogging platforms. In: PACLIC
Turney P (2001) Mining the web for synonyms: Pmi-ir versus lsa on toefl
Turney PD (2000) Learning algorithms for keyphrase extraction. Inf Retr 2 (4):303–336
Turney PD (2002) Thumbs up or thumbs down?: semantic orientation applied to unsupervised classification of reviews. In: Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, pp 417–424
Yang M, Chow K-P (2015) An information extraction framework for digital forensic investigations. In: IFIP international conference on digital forensics. Springer, Cham, pp 61–76
Yang M, Peng B, Chen Z, Zhu D, Chow K-P (2014) A topic model for building fine-grained domain-specific emotion lexicon. pp 421–426. ACL
Yang M, Zhu D, Rashed M, Chow K-P (2014) Learning domain-specific sentiment lexicon with supervised sentiment-aware lda. In: The 21st European conference on artificial intelligence (ECAI). IOS Press
Yang M, Cui T, Tu W (2015) Ordering-sensitive and semantic-aware topic modeling. In: Proceedings of the 29th AAAI conference on artificial intelligence, pp 2353–2359
Zhang C (2008) Automatic keyword extraction from documents using conditional random fields. J Comput Inf Syst 4(3):1169–1180
Zhu L, Shen J, Xie L, Cheng Z (2016) Unsupervised topic hypergraph hashing for efficient mobile image retrieval. IEEE Trans Cybern. doi:10.1109/TCYB.2016.2591068
Zhu J, Xie Q, Yu S-I, Wong WH (2016) Exploiting link structure for web page genre identification. Data Min Knowl Disc 30(3):550–575
Zhu J, Xu C, Li Z, Fung G, Lin X, Huang J, Huang C (2016) An examination of on-line machine learning approaches for pseudo-random generated data. Clust Comput 19(3):1309–1321
Zhu L, Shen J, Xie L, Cheng Z (2017) Unsupervised visual hashing with semantic assistant for content-based image retrieval. IEEE Trans Knowl Data Eng 29(2):472–486
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Yang, M., Liang, Y., Zhao, W. et al. Task-oriented keyphrase extraction from social media. Multimed Tools Appl 77, 3171–3187 (2018). https://doi.org/10.1007/s11042-017-5041-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-017-5041-y