Abstract
Cross language text categorization is the task of exploiting labelled documents in a source language (e.g. English) to classify documents in a target language (e.g. Chinese). In this paper, we focus on investigating the use of a bilingual lexicon for cross language text categorization. To this end, we propose a novel refinement framework for cross language text categorization. The framework consists of two stages. In the first stage, a cross language model transfer is proposed to generate initial labels of documents in target language. In the second stage, expectation maximization algorithm based on naive Bayes model is introduced to yield resulting labels of documents. Preliminary experimental results on collected corpora show that the proposed framework is effective.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Gao, J., Xun, E., Zhou, M., Huang, C., Nie, J.Y., Zhang, J.: Improving query translation for cross-language information retrieval using statistical models. In: ACM SIGIR 2001, pp. 96–104 (2001)
Gao, J., Nie, J.Y.: A study of statistical models for query translation: finding a good unit of translation. In: SIGIR 2006, pp. 194–201. ACM Press, New York (2006)
Liu, Y., Jin, R., Chai, J.Y.: A maximum coherence model for dictionary-based cross-language information retrieval. In: SIGIR 2005, pp. 536–543 (2005)
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society,Series B 39, 1–38 (1977)
Bel, N., Koster, C.H.A., Villegas, M.: Cross-Lingual Text Categorization. In: Koch, T., Sølvberg, I.T. (eds.) ECDL 2003. LNCS, vol. 2769, pp. 126–139. Springer, Heidelberg (2003)
Li, Y., Shawe-Taylor, J.: Using KCCA for Japanese-English cross-language information retrieval and document classification. Journal of Intelligent Information Systems 27, 117–133 (2006)
Olsson, J.S., Oard, D.W., Hajič, J.: Cross-language text classification. In: Proceedings of SIGIR 2005, pp. 645–646. ACM Press, New York (2005)
Gliozzo, A.M., Strapparava, C.: Exploiting comparable corpora and bilingual dictionaries for cross-language text categorization. In: Proceedings of ACL 2006, The Association for Computer Linguistics (2006)
Fortuna, B., Shawe-Taylor, J.: The use of machine translation tools for cross-lingual text mining. In: Learning With Multiple Views, Workshop at the 22nd International Conference on Machine Learning (ICML) (2005)
Rigutini, L., Maggini, M., Liu, B.: An EM based training algorithm for cross-language text categorization. In: Proceedings of WI 2005, Washington, pp. 529–535. IEEE Computer Society, Los Alamitos (2005)
Nigam, K., McCallum, A., Thrun, S., Mitchell, T.: Text classification from labeled and unlabeled documents using EM. Machine Learning 39, 103–134 (2000)
Li, C., Li, H.: Word translation disambiguation using bilingual bootstrapping. In: Proceedings of ACL 2002, pp. 343–351 (2002)
Buckley, C.: Implementation of the SMART information retrieval system. Technical report, Ithaca, NY, USA (1985)
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Fisher, D.H. (ed.) Proceedings of ICML 1997, 14th International Conference on Machine Learning, Nashville, US, pp. 412–420. Morgan Kaufmann Publishers, San Francisco (1997)
Zhang, Y., Wu, K., Gao, J., Vines, P.: Automatic Acquisition of Chinese–English Parallel Corpus from the Web. In: Lalmas, M., MacFarlane, A., Rüger, S.M., Tombros, A., Tsikrika, T., Yavlinsky, A. (eds.) ECIR 2006. LNCS, vol. 3936, pp. 420–431. Springer, Heidelberg (2006)
Resnik, P., Smith, N.A.: The web as a parallel corpus. Comput. Linguist. 29, 349–380 (2003)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Wu, K., Lu, BL. (2008). A Refinement Framework for Cross Language Text Categorization. In: Li, H., Liu, T., Ma, WY., Sakai, T., Wong, KF., Zhou, G. (eds) Information Retrieval Technology. AIRS 2008. Lecture Notes in Computer Science, vol 4993. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-68636-1_39
Download citation
DOI: https://doi.org/10.1007/978-3-540-68636-1_39
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-68633-0
Online ISBN: 978-3-540-68636-1
eBook Packages: Computer ScienceComputer Science (R0)