Probability Density Function for Clustering Validation

Figuera, Pau; Cuzzocrea, Alfredo; García Bringas, Pablo

doi:10.1007/978-3-031-40725-3_12

Pau Figuera¹⁶,
Alfredo Cuzzocrea¹⁷ &
Pablo García Bringas¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14001))

Included in the following conference series:

International Conference on Hybrid Artificial Intelligence Systems

889 Accesses

Abstract

The choice of the number of clusters is a leading problem in Machine Learning. Validation methods provide solutions, with the drawback that inference is not possible. In this manuscript, we derive a distribution for the number of clusters for clustering validation. The starting point of our approach is the data transformation to the probabilistic space. Then, the dependence of the non-negative factorization to the dimensionality of the space span provides a sequence of the traces when the dimensionality varies. Its limit is a gamma. This result allows a non-excluding discussion when interpreting probabilities as credibility levels, and we open the door to inference for clustering validation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Clustering validation by distribution hypothesis learning

Article 09 October 2024

Selecting the Number of Clusters K with a Stability Trade-off: An Internal Validation Criterion

Probabilistic assessment of model-based clustering

Article 26 August 2015

References

Aggarwal, C.C.: Clustering: Algorithms and Applications. CRC Press Taylor and Francis Group, Boca Raton (2014)
Google Scholar
Amari, S.I.: Information geometry of the EM and em algorithms for neural networks. Neural Netw. 8(9), 1379–1408 (1995)
Article Google Scholar
Balakrishnan, N., Nevzorov, V.B.: A Primer on Statistical Distributions. Wiley, Hoboken (2004)
Google Scholar
Chen, J.C.: The nonnegative rank factorizations of nonnegative matrices. Linear Algebra Appl. 62, 207–217 (1984)
Article MathSciNet Google Scholar
Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete data via the EM algorithm. J. Royal Stat. Soc. 39(1), 1–38 (1977)
Article MathSciNet Google Scholar
Deng, H., Han, J.: Probabilistic models for clustering. In: Data Clustering, pp. 61–86. Chapman and Hall/CRC (2018)
Google Scholar
Ding, C., Li, T., Peng, W.: On the equivalence between non-negative matrix factorization and probabilistic latent semantic indexing. Comput. Stat. Data Anal. 52(8), 3913–3927 (2008)
Article MathSciNet Google Scholar
Dougherty, E.R., Brun, M.: A probabilistic theory of clustering. Pattern Recogn. 37(5), 917–925 (2004)
Article Google Scholar
Figuera, P., García Bringas, P.: On the probabilistic latent semantic analysis generalization as the singular value decomposition probabilistic image. J. Stat. Theory Appl. 19, 286–296 (2020). https://doi.org/10.2991/jsta.d.200605.001
Article Google Scholar
Fränti, P., Sieranoja, S.: K-means properties on six clustering benchmark datasets. Appl. Intell. 48(12), 4743–4759 (2018)
Article Google Scholar
Fred, A.L., Jain, A.K.: Cluster validation using a probabilistic attributed graph. In: 2008 19th International Conference on Pattern Recognition, pp. 1–4. IEEE (2008)
Google Scholar
Har-Even, M., Brailovsky, V.L.: Probabilistic validation approach for clustering. Pattern Recogn. Lett. 16(11), 1189–1196 (1995)
Article Google Scholar
Hyslop, J.M.: Infinite Series. Dover Publications, New York (2006)
Google Scholar
Jain Anil, K.: Data clustering: 50 years beyond k-means. Pattern Recogn. Lett. 31(8, SI), 651–666 (2010). https://doi.org/10.1016/j.patrec.2009.09.011. 19th International Conference on Pattern Recognition (ICPR 2008), Tampa, FL, DEC 08-11, 2008
Kassambara, A., Mundt, F.: factoextra: Extract and visualize the results of multivariate data analyses (2019). https://CRAN.R-project.org/package=factoextra. r package version 1.0.6
Kullback, S., Leibler, R.A.: On information and sufficiency. Ann. Math. Stat. 22(1), 79–86 (1951)
Article MathSciNet Google Scholar
Li, Y., Hu, P., Liu, Z., Peng, D., Zhou, J.T., Peng, X.: Contrastive clustering. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 8547–8555 (2021)
Google Scholar
MacQueen, J., et al.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297 (1967)
Google Scholar
Olivares, J., et al.: Kalkayotl: a cluster distance inference code. Astron. Astrophys. 644, A7 (2020)
Article Google Scholar
Pallis, G., Angelis, L., Vakali, A., Pokorny, J.: A probabilistic validation algorithm for web users’ clusters. In: 2004 IEEE International Conference on Systems, Man and Cybernetics (IEEE Cat. No. 04CH37583), vol. 5, pp. 4129–4134. IEEE (2004)
Google Scholar
Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)
Article Google Scholar
Sinaga, K.P., Yang, M.S.: Unsupervised k-means clustering algorithm. IEEE Access 8, 80716–80727 (2020)
Article Google Scholar
Smyth, P.: Model selection for probabilistic clustering using cross-validated likelihood. Stat. Comput. 10(1), 63–72 (2000)
Article Google Scholar
Tibshirani, R., Walther, G., Hastie, T.: Estimating the number of clusters in a data set via the gap statistic. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 63(2), 411–423 (2001)
Article MathSciNet Google Scholar
Ullmann, T., Hennig, C., Boulesteix, A.L.: Validation of cluster analysis results on validation data: a systematic framework. Wiley Interdisc. Rev. Data Min. Knowl. Discov. 12, e1444 (2022)
Google Scholar
Usefi, H.: Clustering, multicollinearity, and singular vectors. Comput. Stat. Data Anal. 173, 107523 (2022)
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

D4K Group, University of Deusto, Bilbao, Spain
Pau Figuera & Pablo García Bringas
iDEA Lab, University of Calabria, Rende, Italy
Alfredo Cuzzocrea

Authors

Pau Figuera
View author publications
You can also search for this author in PubMed Google Scholar
Alfredo Cuzzocrea
View author publications
You can also search for this author in PubMed Google Scholar
Pablo García Bringas
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pau Figuera .

Editor information

Editors and Affiliations

University of Deusto, Bilbao, Spain
Pablo García Bringas
University of Leon, León, Spain
Hilde Pérez García
University of La Rioja, Logroño, La Rioja, Spain
Francisco Javier Martínez de Pisón
Pablo de Olavide University, Seville, Spain
Francisco Martínez Álvarez
Pablo de Olavide University, Seville, Spain
Alicia Troncoso Lora
University of Burgos, Burgos, Spain
Álvaro Herrero
University of A Coruña, Ferrol - Coruña, Spain
José Luis Calvo Rolle
University of A Coruña, Ferrol - Coruña, Spain
Héctor Quintián
University of Salamanca, Salamanca, Spain
Emilio Corchado

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Figuera, P., Cuzzocrea, A., García Bringas, P. (2023). Probability Density Function for Clustering Validation. In: García Bringas, P., et al. Hybrid Artificial Intelligent Systems. HAIS 2023. Lecture Notes in Computer Science(), vol 14001. Springer, Cham. https://doi.org/10.1007/978-3-031-40725-3_12

Download citation

DOI: https://doi.org/10.1007/978-3-031-40725-3_12
Published: 29 August 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-40724-6
Online ISBN: 978-3-031-40725-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Probability Density Function for Clustering Validation