Abstract
Clustering and visualization constitute key issues in computer-supported data inspection, and a variety of promising tools exist for such tasks such as the self-organizing map (SOM) and variations thereof. Real life data, however, pose severe problems to standard data inspection: on the one hand, data are often represented by complex non-vectorial objects and standard methods for finite dimensional vectors in Euclidean space cannot be applied. On the other hand, very large data sets have to be dealt with, such that data do neither fit into main memory, nor more than one pass over the data is still affordable, i.e. standard methods can simply not be applied due to the sheer amount of data. We present two recent extensions of topographic mappings: relational clustering, which can deal with general proximity data given by pairwise distances, and patch processing, which can process streaming data of arbitrary size in patches. Together, an efficient linear time data inspection method for general dissimilarity data structures results. We present the theoretical background as well as applications to the areas of text and multimedia processing based on the generalized compression distance.
Chapter PDF
Similar content being viewed by others
Keywords
- Dissimilarity Measure
- Symmetric Bilinear Form
- Dissimilarity Matrix
- Data Inspection
- Normalize Compression Distance
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Alex, N., Hasenfuss, A., Hammer, B.: Patch clustering for massive data sets. Neurocomputing 72(7-9), 1455–1469 (2009)
Badoiu, M., Har-Peled, S., Indyk, P.: Approximate clustering via core-sets. In: Proc. STOC, pp. 250–257 (2002)
De, G., Barreto, A., Araujo, A.F.R., Kremer, S.C.: A Taxonomy for Spatiotemporal Connectionist Networks Revisited: The Unsupervised Case. Neural Computation 15(6), 1255–1320 (2003)
Belongie, S., Fowlkes, C., Chung, F., Malik, J.: Spectral partitioning with indefinite kernels using the Nyström extension. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2352, pp. 531–542. Springer, Heidelberg (2002)
Bezdek, J.C., Hathaway, R.J., Huband, J.M., Leckie, C., Kotagiri, R.: Approximate data mining in very large relational data. In: Dobbie, G., Bailey, J. (eds.) Proc. Australasian Database Conference, pp. 3–13 (2006)
Bradley, P.S., Fayyad, U., Reina, C.: Scaling clustering algorithms to large data sets. In: Proc. KDD, pp. 9–15. AAAI Press, Menlo Park (1998)
Cilibrasi, R., Vitanyi, M.B.: Clustering by compression. IEEE Transactions on Information Theory 51(4), 1523–1545 (2005)
Cottrell, M., Hammer, B., Hasenfuss, A., Villmann, T.: Batch and median neural gas. Neural Networks 19, 762–771 (2006)
Domingos, P., Hulten, G.: A General Method for Scaling Up Machine Learning Algorithms and its Application to Clustering. In: Proc. ICML, pp. 106–113 (2001)
Farnstrom, F., Lewis, J., Elkan, C.: Scalability for clustering algorithms revisited. SIGKDD Explorations 2(1), 51–57 (2000)
Graepel, T., Obermayer, K.: A stochastic self-organizing map for proximity data. Neural Computation 11, 139–155 (1999)
Guha, S., Rastogi, R., Shim, K.: CURE: an efficient clustering algorithm for large datasets. In: Proc. ACM SIGMOD Int. Conf. on Management of Data, pp. 73–84 (1998)
Hammer, B., Hasenfuss, A.: Topographic mapping of large dissimilarity data sets, Technical Report IFI-01-2010, Clausthal University of Technology (2010)
Hammer, B., Micheli, A., Sperduti, A., Strickert, M.: Recursive self-organizing network models. Neural Networks 17(8-9), 1061–1086 (2004)
Hathaway, R.J., Bezdek, J.C.: Nerf c-means: Non-Euclidean relational fuzzy clustering. Pattern Recognition 27(3), 429–437 (1994)
Hathaway, R.J., Davenport, J.W., Bezdek, J.C.: Relational duals of the c-means algorithms. Pattern Recognition 22, 205–212 (1989)
Heskes, T.: Self-organizing maps, vector quantization, and mixture modeling. IEEE TNN 12, 1299–1305 (2001)
Kohonen, T.: Self-Organizing Maps. Springer, Heidelberg (1995)
Kohonen, T., Somervuo, P.: How to make large self-organizing maps for non-vectorial data. Neural Networks 15, 945–952 (2002)
Kumar, A., Sabharwal, Y., Sen, S.: A simple linear time (1+epsilon)- approximation algorithm for k-means clustering in any dimensions. In: Proc. IEEE FOCS, pp. 454–462 (2004)
Laub, J., Roth, V., Buhmann, J.M., Müller, K.-R.: On the information and representation of non-Euclidean pairwise data. Pattern Recognition 39, 1815–1826 (2006)
Mokbel, B., Hasenfuss, A., Hammer, B.: Graph-based Representation of Symbolic Musical Data. In: Torsello, A., Escolano, F., Brun, L. (eds.) GbRPR 2009. LNCS, vol. 5534, pp. 42–51. Springer, Heidelberg (2009)
Nisbet, R., Elder, J., Miner, G.: Handbook of Statistical Analysis and Data Mining Applications. Academic Press/Elsevier (2009)
Ontrup, J., Ritter, H.: Hyperbolic self-organizing maps for semantic navigation. In: Dietterich, T., Becker, S., Ghahramani, Z. (eds.) Advances in Neural Information Processing Systems, vol. 14, pp. 1417–1424. MIT Press, Cambridge (2001)
Pardalos, P.M., Vavasis, S.A.: Quadratic programming with one negative eigenvalue is NP hard. Journal of Global Optimization 1, 15–22 (1991)
Pekalska, E., Duin, R.P.W.: The Dissimilarity Representation for Pattern Recognition – Foundations and Applications. World scientific, Singapore (2005)
Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
Roth, V., Laub, J., Kawanabe, M., Buhmann, J.M.: Optimal cluster preserving embedding of nonmetric proximity data. IEEE TPAMI 25(12), 1540–1551 (2003)
Sahni, S.: Computationally related problems. SIAM Journal on Computing 3(4), 262–279 (1974)
Seo, S., Obermayer, K.: Self-organizing maps and clustering methods for matrix data. Neural Networks 17, 1211–1230 (2004)
Tino, P., Kaban, A., Sun, Y.: A generative probabilistic approach to visualizing sets of symbolic sequences. In: Kohavi, R., Gehrke, J., DuMouchel, W., Ghosh, J. (eds.) Proc. KDD 2004, pp. 701–706. ACM Press, New York (2004)
Wang, W., Yang, J., Muntz, R.R.: STING: a statistical information grid approach to spatial data mining. In: Proc. VLDB, pp. 186–195 (1997)
Wong, P.C., Thomas, J.: Visual Analytics. IEEE Computer Graphics and Applications 24(5), 20–21 (2004)
Yin, H.: On the equivalence between kernel self-organising maps and self-organising mixture density network. Neural Networks 19(6), 780–784 (2006)
Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: an efficient data clustering method for very large databases. In: Proc. ACM SIGACT-SIGMOD-SIGART Symp. on Principles of Database Systems, pp. 103–114 (1996)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Hammer, B., Hasenfuss, A. (2010). Clustering Very Large Dissimilarity Data Sets. In: Schwenker, F., El Gayar, N. (eds) Artificial Neural Networks in Pattern Recognition. ANNPR 2010. Lecture Notes in Computer Science(), vol 5998. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-12159-3_24
Download citation
DOI: https://doi.org/10.1007/978-3-642-12159-3_24
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-12158-6
Online ISBN: 978-3-642-12159-3
eBook Packages: Computer ScienceComputer Science (R0)