Abstract
Given a massive set of records, similarity join is to find pairs of records with similarity score greater than a threshold. In this paper, we address the problem of scaling up similarity join for general metric distance functions using MapReduce. First, we propose a novel index structure, Similarity Join Tree (SJT), which partitions data based on the underlying data distribution, and distributes similar records to the same group. Different from existing approaches, SJT can prune a large number of comparisons within reduce tasks by utilizing the by-product results generated in partitioning data. Then, to avoid the straggler reduce tasks, we design a graph partition algorithm by extending the well known Fiduccia–Mattheyses algorithm which can ensure load balancing while minimizing communication cost and redundancy in all reduce tasks. Experimental results using real data sets show that our approach is more effective and scalable compared to state-of-the-art algorithms.















Similar content being viewed by others
Notes
For the added virtual nodes in SJT\(_C\), the weight is set to 1.
References
Henzinger MR (2006) Finding near-duplicate web pages: a large-scale evaluation of algorithms. In: Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval, pp 284–291
Broder AZ, Glassman SC, Manasse MS, Zweig G (1997) Syntactic clustering of the web. Comput Netw 29(813):1157–1166
Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1):107–113
Metwally A, Faloutsos C (2012) V-smart-join: a scalable mapreduce framework for all-pair similarity joins of multisets and vectors. Proc VLDB Endow 5(8):704–715
Sarma AD, He Y, Chaudhuri S (2014) Clusterjoin: a similarity joins framework using map-reduce. Proc VLDB Endow 7(12):1059–1070
Shim K, Srikant R, Agrawal R (2002) High-dimensional similarity joins. IEEE Trans Knowl Data Eng 14(1):156–171
Wang Y, Metwally A, Parthasarathy S (2013) Scalable all-pairs similarity search in metric spaces. In: Proceedings of the 19th ACM SIGKDD international conference on knowledge discovery and data mining, pp 829–837
Weber R, Schek H-J, Blott S (1998) A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In: Proceedings of the VLDB conference, pp 194–205
Bingham E, Mannila H (2001) Random projection in dimensionality reduction: applications to image and text data. In: Knowledge discovery and data mining, pp 245–250
Korn F, Jagadish HV, Faloutsos C (1997) Efficiently supporting ad hoc queries in large datasets of time sequences. ACM SIGMOD 26(2):289–300
Chakrabarti K, Keogh EJ, Mehrotra S, Pazzani MJ (2002) Locally adaptive dimensionality reduction for indexing large time series databases. ACM Trans Database Syst 27(2):188–228
Keogh EJ, Pazzani MJ (2000) A simple dimensionality reduction technique for fast similarity search in large time series databases. Knowl Discov Data Min Curr Issues New Appl 122–133
Vernica R, Carey MJ, Li C (2010) Efficient parallel set-similarity joins using mapreduce. In: Proceedings of the 2010 ACM SIGMOD international conference on management of data, pp 495–506
Chaudhuri S, Ganti V, Kaushik R (2006) A primitive operator for similarity joins in data cleaning. In: ICDE’06. Proceedings of the 22nd international conference on data engineering, p 5
Xiao C, Wang W, Lin X, Yu JX, Wang G (2011) Efficient similarity joins for near-duplicate detection. ACM Trans Database Syst (TODS) 36(3):15
Okcan A, Riedewald M (2011) Processing theta-joins using mapreduce. In: SIGMOD, pp 949–960
Beyer Kevin S, Goldstein J, Ramakrishnan R, Shaft U (1999) When is “nearest neighbor” meaningful? In: Proceedings of the ICDT, pp 217–235
Bryant V (1985) Metric spaces: iteration and application. cambridge University Press, Cambridge
Traina C Jr, Santos Filho RF, Traina AJM, Vieira MR, Faloutsos Christos (2007) The omni-family of all-purpose access methods: a simple and effective way to make similarity search more efficient. VLDB J 16(4):483–505
Chen L, Gao Y, Li X, Jensen CS, Chen G (2015) Efficient metric indexing for similarity search. In: International conference on data engineering (ICDE)
Yang S, Yan X, Zong B, Khan A (2012) Towards effective partition management for large graphs. In: SIGMOD
Bourse F, Lelarge M, Vojnovic M (2014) Balanced graph edge partition. In: KDD ’14—20th ACM SIGKDD international conference on knowledge discovery and data mining, pp 1456–1465
Harary F, Norman RZ (1960) Some properties of line digraphs. Rend Circ Mat Palermo 9(2):161–168
Fiduccia CM, Mattheyses RM (1982) A linear-time heuristic for improving network partitions. In: 19th Proceedings of the design automation conference, pp 175–181
Newman DJ, Asuncion A (2007) UCI machine learning repository. http://mlearn.ics.uci.edu/MLRepository.html. Accessed 26 Dec 2015
Wikipedia. https://en.wikipedia.org/wiki/Main_Page. Accessed 28 Dec 2015
Acknowledgments
This work is supported by NSFC under Grant 61173160 and Scientific Research Program of the Higher Education Institution of XinJiang (XJEDU2014S087).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Rights and permissions
About this article
Cite this article
Liu, W., Shen, Y. & Wang, P. An efficient MapReduce algorithm for similarity join in metric spaces. J Supercomput 72, 1179–1200 (2016). https://doi.org/10.1007/s11227-016-1651-9
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-016-1651-9