Abstract
The web is a more and more valuable source of information and organizations are involved in archiving (portions of) it for various purposes, e.g., the Internet Archive www.archive.org. A new mission of the French National Library (BnF) is the “dépôt légal” (legal deposit) of the French web. We describe here some preliminary work on the topic conducted by BnF and INRIA. In particular, we consider the acquisition of the web archive. Issues are the definition of the perimeter of the French web and the choice of pages to read once or more times (to take changes into account). When several copies of the same page are kept, this leads to versioning issues that we briefly consider. Finally, we mention some first experiments.
This was a decision of King FranÇois the 1st.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
S. Abiteboul, M. Preda, and G. Cobena. Computing web page importance without storing the graph of the web (extended abstract). In IEEE Data Engineering Bulletin, Volume 25, 2002.
A. Arvidson, K. Persson, and J. Mannerheim. The kulturarw3 project— the royal swedish web archiw3e— an example of “complete” collection of web pages. In 66th IFLA Council andGener al Conference, 2000. http://www.i.a.org/IV/i.a66/papers/154-157e.htm.
M.K. Bergman. The deep web: Surfacing hidden value. http://www.brightplanet.com/.
Google. Google news search. http://news.google.com/.
Google. www.google.com/.
Maria Halkidi, Benjamin Nguyen, Iraklis Varlamis, and Mihalis Vazirgianis. Thesus: Organising web document collections based on semantics and clustering. Technical Report, 2002.
T. Haveliwala. Efficient computation of pagerank. Technical report, Stanford University, 1999.
H. Garcia-Molina J. Cho. Synchronizing a database to improve freshness. SIGMOD, 2000.
R. Lafontaine. A delta format for XML: Identifying changes in XML and representing the changes in XML. In XML Europe, 2001.
A. Marian, S. Abiteboul, G. Cobena, and L. Mignet. Change-centric management of versions in an XML warehouse. VLDB, 2001.
L. Martin. Networked electronic publications policy, 1999. http://www.nlc-bnc.ca/9/2/p2-9905-07-f.html.
J. Masanes. Pr server les contenus du web. In IVe journ es internationales d’tudes de l’ARSAG— La conservation l’ re du num rique, 2002.
J. Masan s. The BnF’s project for web archiving. In What’s next for Digital Deposit Libraries? ECDL Workshop, 2001. http://www.bnf.fr/pages/infopro/ecdl/france/sld001.htm.
L. Mignet, M. Preda, S. Abiteboul, S. Ailleret, B. Amann, and A. Marian. Acquiring XML pages for a WebHouse. In proceedings of Base de Donn es Avanc esconference, 2000.
Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pagerank citation ranking: Bringing order to the web, 1998.
S. Raghavan and H. Garcia-Molina. Crawling the hidden web. In The VLDB Journal, 2001.
L. Page S. Brin. The anatomy of a large-scale hypertextual web search engine. WWW7 Conference, Computer Networks 30(1–7), 1998.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Abiteboul, S., Cobéna, G., Masanes, J., Sedrati, G. (2002). A First Experience in Archiving the French Web. In: Agosti, M., Thanos, C. (eds) Research and Advanced Technology for Digital Libraries. ECDL 2002. Lecture Notes in Computer Science, vol 2458. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45747-X_1
Download citation
DOI: https://doi.org/10.1007/3-540-45747-X_1
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-44178-6
Online ISBN: 978-3-540-45747-3
eBook Packages: Springer Book Archive