Clustering Very Large Dissimilarity Data Sets

Hammer, Barbara; Hasenfuss, Alexander

doi:10.1007/978-3-642-12159-3_24

Barbara Hammer²¹ &
Alexander Hasenfuss²²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5998))

Included in the following conference series:

IAPR Workshop on Artificial Neural Networks in Pattern Recognition

1147 Accesses
1 Citations

Abstract

Clustering and visualization constitute key issues in computer-supported data inspection, and a variety of promising tools exist for such tasks such as the self-organizing map (SOM) and variations thereof. Real life data, however, pose severe problems to standard data inspection: on the one hand, data are often represented by complex non-vectorial objects and standard methods for finite dimensional vectors in Euclidean space cannot be applied. On the other hand, very large data sets have to be dealt with, such that data do neither fit into main memory, nor more than one pass over the data is still affordable, i.e. standard methods can simply not be applied due to the sheer amount of data. We present two recent extensions of topographic mappings: relational clustering, which can deal with general proximity data given by pairwise distances, and patch processing, which can process streaming data of arbitrary size in patches. Together, an efficient linear time data inspection method for general dissimilarity data structures results. We present the theoretical background as well as applications to the areas of text and multimedia processing based on the generalized compression distance.

Download to read the full chapter text

Chapter PDF

The Visual SuperTree: similarity-based multi-scale visualization

Article 09 May 2019

Dynamic Sampling for Visual Exploration of Large Dense-Dense Matrices

Dynamic Similarity and Distance Measures Based on Quantiles

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Alex, N., Hasenfuss, A., Hammer, B.: Patch clustering for massive data sets. Neurocomputing 72(7-9), 1455–1469 (2009)
Article Google Scholar
Badoiu, M., Har-Peled, S., Indyk, P.: Approximate clustering via core-sets. In: Proc. STOC, pp. 250–257 (2002)
Google Scholar
De, G., Barreto, A., Araujo, A.F.R., Kremer, S.C.: A Taxonomy for Spatiotemporal Connectionist Networks Revisited: The Unsupervised Case. Neural Computation 15(6), 1255–1320 (2003)
Article MATH Google Scholar
Belongie, S., Fowlkes, C., Chung, F., Malik, J.: Spectral partitioning with indefinite kernels using the Nyström extension. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2352, pp. 531–542. Springer, Heidelberg (2002)
Chapter Google Scholar
Bezdek, J.C., Hathaway, R.J., Huband, J.M., Leckie, C., Kotagiri, R.: Approximate data mining in very large relational data. In: Dobbie, G., Bailey, J. (eds.) Proc. Australasian Database Conference, pp. 3–13 (2006)
Google Scholar
Bradley, P.S., Fayyad, U., Reina, C.: Scaling clustering algorithms to large data sets. In: Proc. KDD, pp. 9–15. AAAI Press, Menlo Park (1998)
Google Scholar
Cilibrasi, R., Vitanyi, M.B.: Clustering by compression. IEEE Transactions on Information Theory 51(4), 1523–1545 (2005)
Article MathSciNet Google Scholar
Cottrell, M., Hammer, B., Hasenfuss, A., Villmann, T.: Batch and median neural gas. Neural Networks 19, 762–771 (2006)
Article MATH Google Scholar
Domingos, P., Hulten, G.: A General Method for Scaling Up Machine Learning Algorithms and its Application to Clustering. In: Proc. ICML, pp. 106–113 (2001)
Google Scholar
Farnstrom, F., Lewis, J., Elkan, C.: Scalability for clustering algorithms revisited. SIGKDD Explorations 2(1), 51–57 (2000)
Article Google Scholar
Graepel, T., Obermayer, K.: A stochastic self-organizing map for proximity data. Neural Computation 11, 139–155 (1999)
Article Google Scholar
Guha, S., Rastogi, R., Shim, K.: CURE: an efficient clustering algorithm for large datasets. In: Proc. ACM SIGMOD Int. Conf. on Management of Data, pp. 73–84 (1998)
Google Scholar
Hammer, B., Hasenfuss, A.: Topographic mapping of large dissimilarity data sets, Technical Report IFI-01-2010, Clausthal University of Technology (2010)
Google Scholar
Hammer, B., Micheli, A., Sperduti, A., Strickert, M.: Recursive self-organizing network models. Neural Networks 17(8-9), 1061–1086 (2004)
Article MATH Google Scholar
Hathaway, R.J., Bezdek, J.C.: Nerf c-means: Non-Euclidean relational fuzzy clustering. Pattern Recognition 27(3), 429–437 (1994)
Article Google Scholar
Hathaway, R.J., Davenport, J.W., Bezdek, J.C.: Relational duals of the c-means algorithms. Pattern Recognition 22, 205–212 (1989)
Article MATH MathSciNet Google Scholar
Heskes, T.: Self-organizing maps, vector quantization, and mixture modeling. IEEE TNN 12, 1299–1305 (2001)
Google Scholar
Kohonen, T.: Self-Organizing Maps. Springer, Heidelberg (1995)
Google Scholar
Kohonen, T., Somervuo, P.: How to make large self-organizing maps for non-vectorial data. Neural Networks 15, 945–952 (2002)
Article Google Scholar
Kumar, A., Sabharwal, Y., Sen, S.: A simple linear time (1+epsilon)- approximation algorithm for k-means clustering in any dimensions. In: Proc. IEEE FOCS, pp. 454–462 (2004)
Google Scholar
Laub, J., Roth, V., Buhmann, J.M., Müller, K.-R.: On the information and representation of non-Euclidean pairwise data. Pattern Recognition 39, 1815–1826 (2006)
Article MATH Google Scholar
Mokbel, B., Hasenfuss, A., Hammer, B.: Graph-based Representation of Symbolic Musical Data. In: Torsello, A., Escolano, F., Brun, L. (eds.) GbRPR 2009. LNCS, vol. 5534, pp. 42–51. Springer, Heidelberg (2009)
Chapter Google Scholar
Nisbet, R., Elder, J., Miner, G.: Handbook of Statistical Analysis and Data Mining Applications. Academic Press/Elsevier (2009)
Google Scholar
Ontrup, J., Ritter, H.: Hyperbolic self-organizing maps for semantic navigation. In: Dietterich, T., Becker, S., Ghahramani, Z. (eds.) Advances in Neural Information Processing Systems, vol. 14, pp. 1417–1424. MIT Press, Cambridge (2001)
Google Scholar
Pardalos, P.M., Vavasis, S.A.: Quadratic programming with one negative eigenvalue is NP hard. Journal of Global Optimization 1, 15–22 (1991)
Article MATH MathSciNet Google Scholar
Pekalska, E., Duin, R.P.W.: The Dissimilarity Representation for Pattern Recognition – Foundations and Applications. World scientific, Singapore (2005)
Book MATH Google Scholar
Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
Google Scholar
Roth, V., Laub, J., Kawanabe, M., Buhmann, J.M.: Optimal cluster preserving embedding of nonmetric proximity data. IEEE TPAMI 25(12), 1540–1551 (2003)
Google Scholar
Sahni, S.: Computationally related problems. SIAM Journal on Computing 3(4), 262–279 (1974)
Article MathSciNet Google Scholar
Seo, S., Obermayer, K.: Self-organizing maps and clustering methods for matrix data. Neural Networks 17, 1211–1230 (2004)
Article MATH Google Scholar
Tino, P., Kaban, A., Sun, Y.: A generative probabilistic approach to visualizing sets of symbolic sequences. In: Kohavi, R., Gehrke, J., DuMouchel, W., Ghosh, J. (eds.) Proc. KDD 2004, pp. 701–706. ACM Press, New York (2004)
Chapter Google Scholar
Wang, W., Yang, J., Muntz, R.R.: STING: a statistical information grid approach to spatial data mining. In: Proc. VLDB, pp. 186–195 (1997)
Google Scholar
Wong, P.C., Thomas, J.: Visual Analytics. IEEE Computer Graphics and Applications 24(5), 20–21 (2004)
Article Google Scholar
Yin, H.: On the equivalence between kernel self-organising maps and self-organising mixture density network. Neural Networks 19(6), 780–784 (2006)
Article MATH Google Scholar
Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: an efficient data clustering method for very large databases. In: Proc. ACM SIGACT-SIGMOD-SIGART Symp. on Principles of Database Systems, pp. 103–114 (1996)
Google Scholar

Download references

Author information

Authors and Affiliations

CITEC, University of Bielefeld, Germany
Barbara Hammer
Department of Computer Science, Clausthal University of Technology, Germany
Alexander Hasenfuss

Authors

Barbara Hammer
View author publications
You can also search for this author in PubMed Google Scholar
Alexander Hasenfuss
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Neural Information Processing, Oberer Eselsberg, University of Ulm, 89069, Ulm, Germany
Friedhelm Schwenker
Center for Informatics Science, Nile University, 12677, Giza, Egypt
Neamat El Gayar

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hammer, B., Hasenfuss, A. (2010). Clustering Very Large Dissimilarity Data Sets. In: Schwenker, F., El Gayar, N. (eds) Artificial Neural Networks in Pattern Recognition. ANNPR 2010. Lecture Notes in Computer Science(), vol 5998. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-12159-3_24

Download citation

DOI: https://doi.org/10.1007/978-3-642-12159-3_24
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-12158-6
Online ISBN: 978-3-642-12159-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)