Abstract
Extracting data from Web pages using wrappers is a fundamental problem arising in a large variety of applications of vast practical interest. There are two main issues relevant to Web-data extraction, namely wrapper generation and wrapper maintenance. In this paper, we propose a novel approach to automatic wrapper maintenance. It is based on the observation that despite various page changes, many important features of the pages are preserved, such as text pattern features, annotations, and hyperlinks. Our approach uses these preserved features to identify the locations of the desired values in the changed pages, and repairs wrappers correspondingly. Experiments over several real-world Web sites show that the proposed automatic approach can effectively maintain wrappers to extract desired data with high accuracy.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
1. Baumgartner R, Flesca S, Gottlob G. Visual Web Information Extraction with Lixto. In Proceedings of the Very Large Data Bases; 2001, 119–128.
2. Chidlovskii B. Automatic repairing of Web Wrappers. In 3rd International Workshop on Web Information and Data Management, 2001, 24–30.
3. Hammer J, Brenning M, Garcia-Molina H, Nestorov S, VassalosV, Yemeni R,. Template-based wrappers in the TSIMMIS system. In Proceedings of ACM SIGMOD Conference, 1997, 532–535.
4. Knoblock C A, Lerman K, Minton S, Muslea I. Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 2000, 23(4): 33–41.
5. Kristina Lerman, Steven Minton, Craig A. Knoblock: Wrapper Maintenance: A Machine Learning Approach. J. Artif. Intell. Res. (JAIR.) 18: 149–181 (2003)
6. Kushmerick N. Regression testing for wrapper maintenance. In Proceedings of AAAI, 1999, 74–79
7. Kushmerick N. Wrapper verification. World Wide Web Journal, 2000, 3(2): 79–94.
8. Lerman K. and Minton S. Learning the common structure of data. In AAAI2000.
9. Muslea, I., Minton, S. and Knoblock, C., (2001). Hierarchical wrapper induction for semistructured information sources. Journal of Autonomous Agents and Multi-Agent Systems, 4:93–114.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer
About this paper
Cite this paper
Zhou, S., Lin, Y., Wang, J., Yang, X. (2006). Wrapper Maintenance for Web-Data Extraction Based on Pages Features. In: Kłopotek, M.A., Wierzchoń, S.T., Trojanowski, K. (eds) Intelligent Information Processing and Web Mining. Advances in Soft Computing, vol 35. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-33521-8_31
Download citation
DOI: https://doi.org/10.1007/3-540-33521-8_31
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-33520-7
Online ISBN: 978-3-540-33521-4
eBook Packages: EngineeringEngineering (R0)