Abstract
The choice of the number of clusters is a leading problem in Machine Learning. Validation methods provide solutions, with the drawback that inference is not possible. In this manuscript, we derive a distribution for the number of clusters for clustering validation. The starting point of our approach is the data transformation to the probabilistic space. Then, the dependence of the non-negative factorization to the dimensionality of the space span provides a sequence of the traces when the dimensionality varies. Its limit is a gamma. This result allows a non-excluding discussion when interpreting probabilities as credibility levels, and we open the door to inference for clustering validation.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Aggarwal, C.C.: Clustering: Algorithms and Applications. CRC Press Taylor and Francis Group, Boca Raton (2014)
Amari, S.I.: Information geometry of the EM and em algorithms for neural networks. Neural Netw. 8(9), 1379–1408 (1995)
Balakrishnan, N., Nevzorov, V.B.: A Primer on Statistical Distributions. Wiley, Hoboken (2004)
Chen, J.C.: The nonnegative rank factorizations of nonnegative matrices. Linear Algebra Appl. 62, 207–217 (1984)
Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete data via the EM algorithm. J. Royal Stat. Soc. 39(1), 1–38 (1977)
Deng, H., Han, J.: Probabilistic models for clustering. In: Data Clustering, pp. 61–86. Chapman and Hall/CRC (2018)
Ding, C., Li, T., Peng, W.: On the equivalence between non-negative matrix factorization and probabilistic latent semantic indexing. Comput. Stat. Data Anal. 52(8), 3913–3927 (2008)
Dougherty, E.R., Brun, M.: A probabilistic theory of clustering. Pattern Recogn. 37(5), 917–925 (2004)
Figuera, P., García Bringas, P.: On the probabilistic latent semantic analysis generalization as the singular value decomposition probabilistic image. J. Stat. Theory Appl. 19, 286–296 (2020). https://doi.org/10.2991/jsta.d.200605.001
Fränti, P., Sieranoja, S.: K-means properties on six clustering benchmark datasets. Appl. Intell. 48(12), 4743–4759 (2018)
Fred, A.L., Jain, A.K.: Cluster validation using a probabilistic attributed graph. In: 2008 19th International Conference on Pattern Recognition, pp. 1–4. IEEE (2008)
Har-Even, M., Brailovsky, V.L.: Probabilistic validation approach for clustering. Pattern Recogn. Lett. 16(11), 1189–1196 (1995)
Hyslop, J.M.: Infinite Series. Dover Publications, New York (2006)
Jain Anil, K.: Data clustering: 50 years beyond k-means. Pattern Recogn. Lett. 31(8, SI), 651–666 (2010). https://doi.org/10.1016/j.patrec.2009.09.011. 19th International Conference on Pattern Recognition (ICPR 2008), Tampa, FL, DEC 08-11, 2008
Kassambara, A., Mundt, F.: factoextra: Extract and visualize the results of multivariate data analyses (2019). https://CRAN.R-project.org/package=factoextra. r package version 1.0.6
Kullback, S., Leibler, R.A.: On information and sufficiency. Ann. Math. Stat. 22(1), 79–86 (1951)
Li, Y., Hu, P., Liu, Z., Peng, D., Zhou, J.T., Peng, X.: Contrastive clustering. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 8547–8555 (2021)
MacQueen, J., et al.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297 (1967)
Olivares, J., et al.: Kalkayotl: a cluster distance inference code. Astron. Astrophys. 644, A7 (2020)
Pallis, G., Angelis, L., Vakali, A., Pokorny, J.: A probabilistic validation algorithm for web users’ clusters. In: 2004 IEEE International Conference on Systems, Man and Cybernetics (IEEE Cat. No. 04CH37583), vol. 5, pp. 4129–4134. IEEE (2004)
Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)
Sinaga, K.P., Yang, M.S.: Unsupervised k-means clustering algorithm. IEEE Access 8, 80716–80727 (2020)
Smyth, P.: Model selection for probabilistic clustering using cross-validated likelihood. Stat. Comput. 10(1), 63–72 (2000)
Tibshirani, R., Walther, G., Hastie, T.: Estimating the number of clusters in a data set via the gap statistic. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 63(2), 411–423 (2001)
Ullmann, T., Hennig, C., Boulesteix, A.L.: Validation of cluster analysis results on validation data: a systematic framework. Wiley Interdisc. Rev. Data Min. Knowl. Discov. 12, e1444 (2022)
Usefi, H.: Clustering, multicollinearity, and singular vectors. Comput. Stat. Data Anal. 173, 107523 (2022)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Figuera, P., Cuzzocrea, A., García Bringas, P. (2023). Probability Density Function for Clustering Validation. In: García Bringas, P., et al. Hybrid Artificial Intelligent Systems. HAIS 2023. Lecture Notes in Computer Science(), vol 14001. Springer, Cham. https://doi.org/10.1007/978-3-031-40725-3_12
Download citation
DOI: https://doi.org/10.1007/978-3-031-40725-3_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-40724-6
Online ISBN: 978-3-031-40725-3
eBook Packages: Computer ScienceComputer Science (R0)