Abstract
Web content mining is an important way of Internet information collection and analysis, but most of web pages are non-XML documents, how to extract useful information efficiently from massive web pages is a interesting research topic. On the basis of analyzing the features of web content mining, a XML-based web content mining method is proposed. Firstly, it defines the authority web page using the HITS algorithms, then transforms the non-XML documents into structured XML documents after the data cleaning and extracting by HTML Tidy, finally does data mining on the XML document using text clustering techniques. A science paper web site is chosen as a case study for Web content extracting. Experimental results show that the proposed method works well, it can extract web content efficiently and automatically.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Hui, H.: Research on Key Problems in WEB Text Mining. Beijing University of Posts and Telecommunications, Beijing (2009)
LiGang, W.: Research on Web text Mining Base on XML. Southwest University, Chong Qing (2007)
DongXia, M.: Research on data mining technique to XML documents. Beijing University of Posts and Telecommunications, Beijing (2007)
Guo, X.: Distributed Data Mining Based on Grids. Computer Engineering & Science (2009)
Huijun, L., Qingsheng, Z., Cheng, Z.: Web log mining algorithm based on user interest. Computer Integrated Manufacturing Systems (2009) (in Chinese)
Tang, W., Cen, G., Cheng, J.-q.: Based on XML of Web Mining in Dynamic Dividing Level Instruction System. In: 2010 Second International Workshop on Education Technology and Computer Science, etcs, HuBei, vol. 3, pp. 468–472 (2010)
Mukthyar azam, S., Kiran Kumar, M., Rasool, S., Jakir Ajam, S.: Web data mining Using XML and Agent Framework. International Journal of Computer Science and Network Security (2010)
Jian, L., Chao, X., Shoubiao, T.: Design and Research of a Web Data Mining System. Computer Technology and Development 19(2), 70–72 (2009)
Ting, C., Xiao, N., Weiping, Y.: The Application of Web Data Mining Technique in Competitive Intelligence System of Enterprise Based on XML. In: 2009 Third International Symposium on Intelligent Information Technology Application, vol. 2, pp. 396–399 (2009)
Ying-song, H., Hai-xia, N.: A New Web Mining Data Integration Model Based on XML. Computer Engineering & Science (2007)
Li, L., Rong, Q.-m.: Research of Web Mining Technology Based on XML. In: Proceedings of the 2009 International Conference on Networks Security, Wireless Communications and Trusted Computing, vol. 2, pp. 653–656 (2009)
Li-jun, S., Fan-rong, M.: Research and design of XML-based web text mining model. College of Computer Science, Xuzhou (2007)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Chen, J., Chen, H., Guo, J. (2010). Study on Method of Web Content Mining for Non-XML Documents. In: Zhu, R., Zhang, Y., Liu, B., Liu, C. (eds) Information Computing and Applications. ICICA 2010. Communications in Computer and Information Science, vol 106. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-16339-5_31
Download citation
DOI: https://doi.org/10.1007/978-3-642-16339-5_31
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-16338-8
Online ISBN: 978-3-642-16339-5
eBook Packages: Computer ScienceComputer Science (R0)