Abstract
For successful clustering, an algorithm needs to find the boundaries between clusters. While this is comparatively easy if the clusters are compact and non-overlapping and thus the boundaries clearly defined, features where the clusters blend into each other hinder clustering methods to correctly estimate these boundaries. Therefore, we aim to extract features showing clear cluster boundaries and thus enhance the cluster structure in the data. Our novel technique creates a condensed version of the data set containing the structure important for clustering, but without the noise-information. We demonstrate that this transformation of the data set is much easier to cluster for k-means, but also various other algorithms. Furthermore, we introduce a deterministic initialisation strategy for k-means based on these structure-rich features.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
We follow the argument given in [16] in regard to the explicit form of the derivative.
References
Arthur, D., Vassilvitskii, S.: k-means++: the advantages of careful seeding, SODA (2007)
Celebi, M., Kingravi, H., Vela, P.: A comparative study of efficient initialisation methods for the K-Means clustering algorithm. Expert Syst. Appl. 40(1), 200–210 (2013)
Chronis, P., Athanasiou, S., Skiadopoulos, S.: Automatic clustering by detecting significant density dips in multiple dimensions. In: ICDM (2019)
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum-Likelihood from incomplete data via the EM algorithm. J. Royal Stat. Soc. 39(1), 1–22 (1977)
Ester, M., Kriegel, H.-P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD (1996)
Goebl, S., He, X., Plant, C., Böhm, C.: Finding the optimal subspace for clustering. In: ICDM (2014)
Guo, X., Gao, L., Liu, X., Yin, J.: Improved deep embedded clustering with local structure preservation. In: IJCAI (2017)
Hartigan, J.A., Hartigan, P.M.: The dip test of unimodality. Ann. Stat. 131, 70–84 (1985)
Jing, L., Ng, M.K., Huang, J.Z.: An entropy weighting k-means algorithm for subspace clustering of high-dimensional sparse data. In: TKDE (2007)
Kalogeratos, A., Likas, A.: Dip-means: an incremental clustering method for estimating the number of clusters. In: NIPS (2012)
Krause, A., Liebscher, V.: Multimodal projection pursuit using the dip statistic, Preprint-Reihe Mathematik (2005)
Kriegel, H.P., Kröger, P., Zimek, A.: Clustering high-dimensional data: a survey on subspace clustering, pattern-based clustering, and correlation clustering. In: TKDD (2009)
Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. (2008)
MacQueen, J.B.: Some methods for classification and analysis of multivariate observations. In: Berkeley Symposium on Math. Stat. and Prob. (1967)
McInnes, L., Healy, J., Melville, J.: Uniform manifold approximation and projection for dimension reduction. arXiv:1802.03426 (2018)
Maurus, S., Plant, C.: Skinny-dip: clustering in a sea of noise. In: KDD (2016)
Mautz, D., Ye, W., Plant, C., Böhm, C.: Towards an optimal subspace for k-means. In: KDD (2017)
Ng, A., Jordan, M., Weiss, Y.: On spectral clustering: analysis and an algorithm. In: NIPS (2002)
Schelling, B., Plant, C.: DipTransformation: enhancing the structure of a dataset and thereby improving clustering. In: ICDM (2018)
Schelling, B., Plant, C.: Dataset-transformation: improving clustering by enhancing the structure with DipScaling and DipTransformation. In: KAIS (2019)
Sibson, R.: SLINK: an optimally efficient algorithm for the single-link cluster method. Comput. J. 16(1), 30–34 (1973)
Siffer, A., Fouque, P.A., Termier, A., Largouet, C.: Are your data gathered? In: KDD (2018)
Vinh, N.X., Bailey, J.: Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. JMLR 11, 2837–2854 (2011)
Wu, H., Gu, X.: Max-Pooling dropout for regularization of convolutional neural networks. In: ICONIP (2015)
Xie, J., Girshick, R., Farhadi, A.: Unsupervised deep embedding for clustering analysis. In: ICML (2016)
Yang, B., Fu, X., Sidiropoulos, N.: Learning from hidden traits: joint factor analysis and latent clustering. IEEE Trans. Signal Process. (2017)
Yang, B., Fu, X., Sidiropoulos, N., Hong, M.: Towards K-means-friendly spaces: simultaneous deep learning and clustering. In: ICML (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Schelling, B., Bauer, L.G.M., Behzadi, S., Plant, C. (2021). Utilizing Structure-Rich Features to Improve Clustering. In: Hutter, F., Kersting, K., Lijffijt, J., Valera, I. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2020. Lecture Notes in Computer Science(), vol 12457. Springer, Cham. https://doi.org/10.1007/978-3-030-67658-2_6
Download citation
DOI: https://doi.org/10.1007/978-3-030-67658-2_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-67657-5
Online ISBN: 978-3-030-67658-2
eBook Packages: Computer ScienceComputer Science (R0)