Study on Method of Web Content Mining for Non-XML Documents

Chen, Jianguo; Chen, Hao; Guo, Jie

doi:10.1007/978-3-642-16339-5_31

Jianguo Chen^4,5,
Hao Chen^4,6 &
Jie Guo⁴

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 106))

Included in the following conference series:

International Conference on Information Computing and Applications

1542 Accesses

Abstract

Web content mining is an important way of Internet information collection and analysis, but most of web pages are non-XML documents, how to extract useful information efficiently from massive web pages is a interesting research topic. On the basis of analyzing the features of web content mining, a XML-based web content mining method is proposed. Firstly, it defines the authority web page using the HITS algorithms, then transforms the non-XML documents into structured XML documents after the data cleaning and extracting by HTML Tidy, finally does data mining on the XML document using text clustering techniques. A science paper web site is chosen as a case study for Web content extracting. Experimental results show that the proposed method works well, it can extract web content efficiently and automatically.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Clustering XML Documents Using Frequent Edge-Sets

Machine learning techniques for XML (co-)clustering by structure-constrained phrases

Article 04 August 2017

A Framework for Clustering and Dynamic Maintenance of XML Documents

References

Hui, H.: Research on Key Problems in WEB Text Mining. Beijing University of Posts and Telecommunications, Beijing (2009)
Google Scholar
LiGang, W.: Research on Web text Mining Base on XML. Southwest University, Chong Qing (2007)
Google Scholar
DongXia, M.: Research on data mining technique to XML documents. Beijing University of Posts and Telecommunications, Beijing (2007)
Google Scholar
Guo, X.: Distributed Data Mining Based on Grids. Computer Engineering & Science (2009)
Google Scholar
Huijun, L., Qingsheng, Z., Cheng, Z.: Web log mining algorithm based on user interest. Computer Integrated Manufacturing Systems (2009) (in Chinese)
Google Scholar
Tang, W., Cen, G., Cheng, J.-q.: Based on XML of Web Mining in Dynamic Dividing Level Instruction System. In: 2010 Second International Workshop on Education Technology and Computer Science, etcs, HuBei, vol. 3, pp. 468–472 (2010)
Google Scholar
Mukthyar azam, S., Kiran Kumar, M., Rasool, S., Jakir Ajam, S.: Web data mining Using XML and Agent Framework. International Journal of Computer Science and Network Security (2010)
Google Scholar
Jian, L., Chao, X., Shoubiao, T.: Design and Research of a Web Data Mining System. Computer Technology and Development 19(2), 70–72 (2009)
Google Scholar
Ting, C., Xiao, N., Weiping, Y.: The Application of Web Data Mining Technique in Competitive Intelligence System of Enterprise Based on XML. In: 2009 Third International Symposium on Intelligent Information Technology Application, vol. 2, pp. 396–399 (2009)
Google Scholar
Ying-song, H., Hai-xia, N.: A New Web Mining Data Integration Model Based on XML. Computer Engineering & Science (2007)
Google Scholar
Li, L., Rong, Q.-m.: Research of Web Mining Technology Based on XML. In: Proceedings of the 2009 International Conference on Networks Security, Wireless Communications and Trusted Computing, vol. 2, pp. 653–656 (2009)
Google Scholar
Li-jun, S., Fan-rong, M.: Research and design of XML-based web text mining model. College of Computer Science, Xuzhou (2007)
Google Scholar

Download references

Author information

Authors and Affiliations

Software School, Hunan University, Changsha, 410082, China
Jianguo Chen, Hao Chen & Jie Guo
Software College, Fujian University of Technology, Fujian, 350003, China
Jianguo Chen
School of Information Science and Engineering, Central South University, Changsha
Hao Chen

Authors

Jianguo Chen
View author publications
You can also search for this author in PubMed Google Scholar
Hao Chen
View author publications
You can also search for this author in PubMed Google Scholar
Jie Guo
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

College of Computer Science, South-Central University for Nationalities, 708 Minyuan Road, 430074, Wuhan, China
Rongbo Zhu
8001, Melbourne, VIC, Australia
Yanchun Zhang
College of sciences, He’Bei polytechnic University, 063000, Tangshan, Hebei, China
Baoxiang Liu & Chunfeng Liu &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chen, J., Chen, H., Guo, J. (2010). Study on Method of Web Content Mining for Non-XML Documents. In: Zhu, R., Zhang, Y., Liu, B., Liu, C. (eds) Information Computing and Applications. ICICA 2010. Communications in Computer and Information Science, vol 106. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-16339-5_31

Download citation

DOI: https://doi.org/10.1007/978-3-642-16339-5_31
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-16338-8
Online ISBN: 978-3-642-16339-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics