Abstract
Recently, supervised topic modeling approaches have received considerable attention. However, the representative labeled latent Dirichlet allocation (L-LDA) method has a tendency to over-focus on the pre-assigned labels, and does not give potentially lost labels and common semantics sufficient consideration. To overcome these problems, we propose an extension of L-LDA, namely supervised labeled latent Dirichlet allocation (SL-LDA), for document categorization. Our model makes two fundamental assumptions, i.e., Prior 1 and Prior 2, that relax the restriction of label sampling and extend the concept of topics. In this paper, we develop a Gibbs expectation-maximization algorithm to learn the SL-LDA model. Quantitative experimental results demonstrate that SL-LDA is competitive with state-of-the-art approaches on both single-label and multi-label corpora.






Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
In fact, \(Dir^{\prime }(\overrightarrow \alpha ,{y_{d}})\) is a pseudo-PDF, because \({\int }_{- \infty }^{\infty } Dir^{\prime }(\overrightarrow \alpha ,{y_{d}}) = \) 1−𝜃 ∗. Here, it is used to sample the absence topic components in \(\overrightarrow {{{\Lambda }_{d}}}\).
Results for the Yahoo! Health collection are very similar.
We set K = 100 for Yahoo! Arts and K = 240 for RCV1-v2.
References
Ali D, Faqir M (2012) Group topic modeling for academic knowledge discovery. Appl. Intell. 36(4):870–886
Andrieu C, Freitas ND, Doucet A, Jordan MI (2003) An introduction to MCMC for machine learning. Mach Learn 50(1):5–43
Blei DM, Lafferty JD (2007) A correlated topic model fo science. Ann Appl Stat 1(1):17–35
Blei DM, McAuliffe JD (2007) Supervised topic models. In: Neural information processing systems
Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022
Fei-Fei L, Perona P (2005) A Bayesian hierarchical model for learning natural scene categories. In: IEEE computer society conference on computer vision and pattern recognition , vol 2, pp 524–531
Heinrich G. (2005) Parameter estimation for text analysis. http://www.arbylon.net/publications/textest
Hofmann T (1999) Probabilistic latent semantic indexing. In: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pp 50–57
Jaegul C, Changhyun L, Chandan KR, Park H (2013) Utopian: user-driven topic modeling based on interactive nonnegative matrix factorization. IEEE Trans Vis Comput Graph 19(12):1992–2001
Ji S, Tang L, Yu S, Ye J (2008) Extracting shared subspace for multi-label classification. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining, pp 381–389
Kim D, Kim S, Oh A (2012) Dirichlet process with mixed random measures: a nonparametric topic model for labeled data. In: 29th International conference on machine learning, pp 727–734
Lacoste-Julien S, Sha F, Jordan MI (2009) Disclda: discriminative learning for dimensionality reduction and classification. In: Neural information processing systems, pp 897–904
Lewis DD, andTony G, Rose YY, Li F (2004) Rcv1: a new benchmark collection for text categorization research. J Mach Learn Res 5:361–397
Quelhas P, Monay F, Odobez JM, Gatica-Perez D, Tuytelaars T, Van Gool L (2005) Modeling scenes with local descriptors and latent aspects. Comput Vis IEEE Int Conf 1:883–890
Ramage D, Hall D, Nallapati R, Manning CD (2009) Labeled LDA: a supervised topic model for credit attribution in multi-labeled corpora. In: Conference on empirical methods in natural language processing, pp 248–256. Association for Computational Linguistics
Ramage D, Manning CD, Dumais S (2011) Partially labeled topic models for interpretable text mining. In: ACM SIGKDD international conference on knowledge discovery and data mining, pp 457–465
Rubin TN, Chambers A, Smyth P, Steyvers M (2012) Statistical topic models for multi-label document classification. Mach Learn 88(1–2):157–208
Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv (CSUR) 34(1):1–47
Wallach H (2006) Topic modeling: beyond bag-of-words. In: Proceedings of the 23rd international conference on Machine learning, pp 977–984. ACM
Xie P, Xing EP (2013) Integrating document clustering and topic modeling. In: Proceedings of the 20th conference on uncertainty in artificial intelligence, pp 694–703
Xu Y, Guo R (2014) An inproved nu-twin support vector machine. Appl Intell 41(1):42–54
Zhang ML, Zhang K (2010) Multi-label learning by exploiting label dependency. In: Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining, pp 999–1008
Zhu J, Ahmed A, Xing E (2009) Medlda: maximum margin supervised topic models for regression and classification. In: Proceedings of the 26th annual international conference on machine learning, pp 1257–1264. ACM
Zhu J, Ahmed A, Xing E. (2012) Medlda: maximum margin supervised topic models
Acknowledgments
This work was supported by National Nature Science Foundation of China (NSFC) under the Grant No. 61170092, 61133011, and 61103091.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Li, X., Ouyang, J., Zhou, X. et al. Supervised labeled latent Dirichlet allocation for document categorization. Appl Intell 42, 581–593 (2015). https://doi.org/10.1007/s10489-014-0595-0
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-014-0595-0