Wrapper Maintenance for Web-Data Extraction Based on Pages Features

Zhou, Shunxian; Lin, Yaping; Wang, Jingpu; Yang, Xiaolin

doi:10.1007/3-540-33521-8_31

Shunxian Zhou³,
Yaping Lin³,
Jingpu Wang⁴ &
…
Xiaolin Yang⁴

Part of the book series: Advances in Soft Computing ((AINSC,volume 35))

614 Accesses

Abstract

Extracting data from Web pages using wrappers is a fundamental problem arising in a large variety of applications of vast practical interest. There are two main issues relevant to Web-data extraction, namely wrapper generation and wrapper maintenance. In this paper, we propose a novel approach to automatic wrapper maintenance. It is based on the observation that despite various page changes, many important features of the pages are preserved, such as text pattern features, annotations, and hyperlinks. Our approach uses these preserved features to identify the locations of the desired values in the changed pages, and repairs wrappers correspondingly. Experiments over several real-world Web sites show that the proposed automatic approach can effectively maintain wrappers to extract desired data with high accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 259.00; Price excludes VAT (USA)

Softcover Book: USD 329.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

User-Friendly and Extensible Web Data Extraction

Efficient Page-Level Data Extraction via Schema Induction and Verification

Web Page Representations and Data Extraction with BERyL

References

1. Baumgartner R, Flesca S, Gottlob G. Visual Web Information Extraction with Lixto. In Proceedings of the Very Large Data Bases; 2001, 119–128.
Google Scholar
2. Chidlovskii B. Automatic repairing of Web Wrappers. In 3rd International Workshop on Web Information and Data Management, 2001, 24–30.
Google Scholar
3. Hammer J, Brenning M, Garcia-Molina H, Nestorov S, VassalosV, Yemeni R,. Template-based wrappers in the TSIMMIS system. In Proceedings of ACM SIGMOD Conference, 1997, 532–535.
Google Scholar
4. Knoblock C A, Lerman K, Minton S, Muslea I. Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 2000, 23(4): 33–41.
Google Scholar
5. Kristina Lerman, Steven Minton, Craig A. Knoblock: Wrapper Maintenance: A Machine Learning Approach. J. Artif. Intell. Res. (JAIR.) 18: 149–181 (2003)
Google Scholar
6. Kushmerick N. Regression testing for wrapper maintenance. In Proceedings of AAAI, 1999, 74–79
Google Scholar
7. Kushmerick N. Wrapper verification. World Wide Web Journal, 2000, 3(2): 79–94.
Article MATH Google Scholar
8. Lerman K. and Minton S. Learning the common structure of data. In AAAI2000.
Google Scholar
9. Muslea, I., Minton, S. and Knoblock, C., (2001). Hierarchical wrapper induction for semistructured information sources. Journal of Autonomous Agents and Multi-Agent Systems, 4:93–114.
Article Google Scholar

Download references

Author information

Authors and Affiliations

College of software, University of Hunan, 410082, Changsha, China
Shunxian Zhou & Yaping Lin
College of Computer and Communication, University of Hunan, 410082, Changsha, China
Jingpu Wang & Xiaolin Yang

Authors

Shunxian Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Yaping Lin
View author publications
You can also search for this author in PubMed Google Scholar
Jingpu Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaolin Yang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Polish Academy of Sciences, Institute of Computer Science, ul. Ordona 21, 01-237, Warszawa, Poland
Mieczysław A. Kłopotek , Sławomir T. Wierzchoń & Krzysztof Trojanowski , &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhou, S., Lin, Y., Wang, J., Yang, X. (2006). Wrapper Maintenance for Web-Data Extraction Based on Pages Features. In: Kłopotek, M.A., Wierzchoń, S.T., Trojanowski, K. (eds) Intelligent Information Processing and Web Mining. Advances in Soft Computing, vol 35. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-33521-8_31

Download citation

DOI: https://doi.org/10.1007/3-540-33521-8_31
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-33520-7
Online ISBN: 978-3-540-33521-4
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics