Abstract
The creation time of documents is an important kind of information in temporal information retrieval, especially for document clustering, timeline construction and search engine improvements. Considering the manner in which content on the Web is created, updated & deleted, the common assumption that each document has only one creation time is not suitable for Web documents. In this paper, we investigate to what extent this assumption is wrong. We introduce two methods to timestamp individual parts (sub-documents) of Web documents and analyze in detail the creation & update dynamics of three classes of Web documents.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
This time range was chosen due to our experimental data, cf. Sect. 4.
- 3.
- 4.
- 5.
CRF++: https://taku910.github.io/crfpp/.
- 6.
- 7.
Specifically, we sampled from Disk1 of the ClueWeb12 corpus.
- 8.
We mean here all versions available on IA, not just those with changed content.
- 9.
- 10.
\(max\_features\) is 3 and 6, C-value is \({9 \times 10^{-6}}\).
- 11.
McNemar’s test was employed for statistical significance testing, with \(p< 0.01\).
- 12.
\(max\_features\) is 5 and 13, C-value is \({9 \times 10^{-5}}\).
- 13.
\(max\_features\) is 7 and 11, C-value is \(1 \times 10^{-6}\).
References
Adar, E., Teevan, J., Dumais, S.T., Elsas, J.L.: The web changes everything: understanding the dynamics of web content. In: WSDM 2009, pp. 282–291 (2009)
Baeza-Yates, R., Pereira, Á., Ziviani, N.: Genealogical trees on the web: a search engine user perspective. In: WWW 2008, pp. 367–376. ACM (2008)
Bernard, S., Heutte, L., Adam, S.: Influence of hyperparameters on random forest accuracy. In: Benediktsson, J.A., Kittler, J., Roli, F. (eds.) MCS 2009. LNCS, vol. 5519, pp. 171–180. Springer, Heidelberg (2009)
Campos, R., Dias, G., Jorge, A.M., Jatowt, A.: Survey of temporal information retrieval and related applications. ACM Comput. Surv. (CSUR) 47(2), 15 (2015)
Chambers, N.: Labeling documents with timestamps: learning from their time expressions. In: ACL 2012, pp. 98–106 (2012)
Cho, J., Garcia-Molina, H.: The evolution of the web and implications for an incremental crawler (1999)
Cohen, W., Ravikumar, P., Fienberg, S.: A comparison of string metrics for matching names and records. In: KDD Workshop on Data Cleaning and Object Consolidation, vol. 3, pp. 73–78 (2003)
Cormack, G., Smucker, M., Clarke, C.: Efficient & effective spam filtering & re-ranking for large web datasets. Inf. Retrieval 14(5), 441–465 (2011)
de Jong, F., Rode, H., Hiemstra, D.: Temporal language models for the disclosure of historical text. Royal Netherlands Academy of Arts and Sciences (2005)
Döhling, L., Leser, U.: Extracting and aggregating temporal events from text. In: WWW 2014, pp. 839–844 (2014)
Ge, T., Chang, B., Li, S., Sui, Z.: Event-based time label propagation for automatic dating of news articles. In: EMNLP 2013, pp. 1–11 (2013)
Jatowt, A., Kawai, Y., Ohshima, H., Tanaka, K.: What can history tell us?: towards different models of interaction with document histories. In: ACM HyperText 2008, pp. 5–14 (2008)
Jatowt, A., Kawai, Y., Tanaka, K.: Detecting age of page content. In: Proceedings of the 9th Annual ACM International Workshop on Web Information and Data Management, pp. 137–144. ACM (2007)
Jones, R., Diaz, F.: Temporal profiles of queries. ACM Trans. Inf. Syst. 25(3), 14 (2007)
Kanhabua, N., Nørvåg, K.: Improving temporal language models for determining time of non-timestamped documents. In: Christensen-Dalsgaard, B., Castelli, D., Ammitzbøll Jurik, B., Lippincott, J. (eds.) ECDL 2008. LNCS, vol. 5173, pp. 358–370. Springer, Heidelberg (2008)
Kanhabua, N., Nørvåg, K.: Using temporal language models for document dating. In: Buntine, W., Grobelnik, M., Mladenić, D., Shawe-Taylor, J. (eds.) ECML PKDD 2009, Part II. LNCS, vol. 5782, pp. 738–741. Springer, Heidelberg (2009)
Kumar, A., Lease, M., Baldridge, J.: Supervised language modeling for temporal resolution of texts. In: CIKM 2011, pp. 2069–2072 (2011)
Lafferty, J., McCallum, A., Pereira, F.C.: Conditional random fields: probabilistic models for segmenting and labeling sequence data (2001)
Li, X., Croft, W.B.: Time-based language models. In: CIKM 2003, pp. 469–475 (2003)
Ntoulas, A., Cho, J., Olston, C.: What’s new on the web?: the evolution of the web from a search engine perspective. In: WWW 2004, pp. 1–12 (2004)
Oshiro, T.M., Perez, P.S., Baranauskas, J.A.: How many trees in a random forest? In: Perner, P. (ed.) MLDM 2012. LNCS, vol. 7376, pp. 154–168. Springer, Heidelberg (2012)
Swan, R., Jensen, D.: Timemines: constructing timelines with statistical models of word usage. In: KDD Workshop on Text Mining, pp. 73–80 (2000)
Zhao, Y., Hauff, C.: Sub-document timestamping of web documents. In: SIGIR 2015, pp. 1023–1026 (2015)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Zhao, Y., Hauff, C. (2016). Sub-document Timestamping: A Study on the Content Creation Dynamics of Web Documents. In: Fuhr, N., Kovács, L., Risse, T., Nejdl, W. (eds) Research and Advanced Technology for Digital Libraries. TPDL 2016. Lecture Notes in Computer Science(), vol 9819. Springer, Cham. https://doi.org/10.1007/978-3-319-43997-6_16
Download citation
DOI: https://doi.org/10.1007/978-3-319-43997-6_16
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-43996-9
Online ISBN: 978-3-319-43997-6
eBook Packages: Computer ScienceComputer Science (R0)