Abstract
We exploit the Feller-Pareto characterization of the classical Pareto distribution to derive a law relating the probability of a given term frequency in a document and its the length. A similar law was derived by Mandelbrot. We exploit the paretian distribution to obtain a term frequency normalization to substitute for the actual term frequency in the probabilistic models of Information Retrieval recently introduced in TREC-10. Preliminary results show that the unique parameter of the framework can be eliminated in favour of the the term frequency normalization derived by the Paretian law.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Gianni Amati, Claudio Carpineto, and Giovanni Romano. FUB at TREC 10 web track: a probabilistic framework for topic relevance term weighting. In In Proceedings of the 10th Text Retrieval Conference (TREC-10), Gaithersburg, MD, 2001.
Gianni Amati and Cornelis Joost van Rijsbergen. Probabilistic models of information retrieval based on measuring divergence from randomness. Submitted to TOIS, 2001.
Barry C. Arnold. Pareto distributions. International Co-operative Publishing House, Fairland, Md., 1983.
C. Carpineto, R. De Mori, G. Romano, and B. Bigi. An information theoretic approach to automatic query expansion. ACM Transactions on Information Systems, 19(1):1–27, 2001.
D.G. Champernowne. The theory of income distribution. Econometrica, 5:379–381, 1937.
Mark E. Crovella, Murad S. Taqqu, and Azer Bestavros. Heavy-tailed probability distributions in the world wide web. In R.J. Adler, R.E. Feldman, and M.S. Taqqu, editors, A practical guide to heavy tails. Birkhauser, Boston, Basel and Berlin, 1998.
J.B. Estoup. Gammes Stenographiques. 4th edition, Paris, 1916.
William Feller. An introduction to probability theory and its applications. Vol. I. John Wiley & Sons Inc., New York, third edition, 1968.
William Feller. An Introduction to Probability Theory and Its Applications, volume II. John Wiley & Sons, New York, second edition, 1971.
D Hawking. Overview of the trec-9 web track. In In Proceedings of the 9th Text Retrieval Conference (TREC-9), Gaithersburg, MD, 2001.
G. Herdan. Quantitative Linguistics. Butterworths, 1964.
Benoit Mandelbrot. On the theory of word frequencies and on related markovian models of discourse. In Proceedings of Symposia in Applied Mathematics. Vol. XII: Structure of language and its mathematical aspects, pages 190–219. American Mathematical Society, Providence, R.I., 1961. Roman Jakobson, editor.
H. S. Sichel. Parameter estimation for a word frequency distribution based on occupancy theory. Comm. Statist. A—Theory Methods, 15(3):935–949, 1986.
H. S. Sichel. Word frequency distributions and type-token characteristics. Math. Sci., 11(1):45–72, 1986.
Herbert A. Simon. On a class of skew distribution functions. Biometrika, 42:425–440, 1955.
J.C. Willis. Age and area. Cambridge University Press, London and New York, 1922.
G.K. Zipf. Human behavior and the principle of least effort. Addison-Wesley Press, Reading, Massachusetts, 1949.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Amati, G., van Rijsbergen, C.J. (2002). Term Frequency Normalization via Pareto Distributions. In: Crestani, F., Girolami, M., van Rijsbergen, C.J. (eds) Advances in Information Retrieval. ECIR 2002. Lecture Notes in Computer Science, vol 2291. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45886-7_13
Download citation
DOI: https://doi.org/10.1007/3-540-45886-7_13
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-43343-9
Online ISBN: 978-3-540-45886-9
eBook Packages: Springer Book Archive