Abstract
Mining patterns from multi-relational data is a problem attracting increasing interest within the data mining community. Traditional data mining approaches are typically developed for single-table databases, and are not directly applicable to multi-relational data. Nevertheless, multi-relational data is a more truthful and therefore often also a more powerful representation of reality. Mining patterns of a suitably expressive syntax directly from this representation, is thus a research problem of great importance. In this paper we introduce a novel approach to mining patterns in multi-relational data. We propose a new syntax for multi-relational patterns as complete connected subsets of database entities. We show how this pattern syntax is generally applicable to multi-relational data, while it reduces to well-known tiles “ Geerts et al. (Proceedings of Discovery Science, pp 278–289, 2004)” when the data is a simple binary or attribute-value table. We propose RMiner, a simple yet practically efficient divide and conquer algorithm to mine such patterns which is an instantiation of an algorithmic framework for efficiently enumerating all fixed points of a suitable closure operator “Boley et al. (Theor Comput Sci 411(3):691–700, 2010)”. We show how the interestingness of patterns of the proposed syntax can conveniently be quantified using a general framework for quantifying subjective interestingness of patterns “De Bie (Data Min Knowl Discov 23(3):407–446, 2011b)”. Finally, we illustrate the usefulness and the general applicability of our approach by discussing results on real-world and synthetic databases.





















Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
In contrast to some traditional fixpoint enumeration algorithms, as they are for instance used in the context of formal concept analysis, this divide and conquer approach does neither assume an underlying complete lattice nor that the fixpoint set is closed under intersection. This is important because the set system of CCSs is not necessarily closed under intersection (due to connectivity) and two MCCSs cannot be joined to a common supremum (due to completeness).
Please note that by entities and entity types here, we actually refer to our notion of the terms. The same notions are defined as objects and entities respectively in Nijssen et al. (2011).
Note that practically, the quadratic space complexity of RMiner results from multiplying a linear space complexity with the maximal search tree depth, which, as we will show in Sect. 7.3, is practically a small constant. Also, as we discussed in Sect. 3.5, the practical time delay of RMiner depends on the density of the data set and can be optimised in practice by taking particular implementation choices. Thus, even though the theoretical complexities of Makino and Uno (2004) and RMiner are comparable, RMiner probably scales better in practice.
References
Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: Proceedings of the 20th international conference on very large data bases (VLDB), pp 487–499
Angles R, Gutierrez C (2008) Survey of graph database models. ACM Comput Surv 40(1):1:1–1:39
Birkhoff G (1967) Lattice theory. American Mathematical Society, Providence
Boley M (2011) The efficient discovery of interesting closed pattern collections. PhD thesis, University of Bonn, Bonn
Boley M, Horvath T, Poigné A, Wrobel S (2010) Listing closed sets of strongly accessible set systems with applications to data mining. Theor Comput Sci 411(3):691–700
Bron C, Kerbosch J (1973) Algorithm 457: finding all cliques of an undirected graph. Commun ACM 16(9):575–577
Burdick D, Calimlim M, Flannick J, Gehrke J, Yiu T (2005) Mafia: a maximal frequent itemset algorithm. IEEE Trans Knowl Data Eng 17(11):1490–1504
Calders T, Goethals B (2007) Non-derivable itemset mining. Data Min Knowl Discov 14(1):171–206
Cerf L, Besson J, Robardet C, Boulicaut JF (2009) Closed patterns meet n-ary relations. ACM Trans Knowl Discov Data 3(1):3:1–3:36
Cover TM, Thomas JA (2005) Elements of information theory. Wiley, Hoboken
De Bie T (2011a) An information theoretic framework for data mining. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 564–572
De Bie T (2011b) Maximum entropy models and subjective interestingness: an application to tiles in binary databases. Data Min Knowl Discov 23(3):407–446
De Bie T, Kontonasios KN, Spyropoulou E (2010) A framework for mining interesting pattern sets. In: SIGKDD explorations, pp 92–100
De Raedt L, Zimmermann A (2007) Constraint-based pattern set mining. In: Proceedings of the SIAM international conference on data mining (SDM), pp 237–248
Dehaspe L, Toivonen H (1999) Discovery of frequent datalog patterns. Data Min Knowl Discov 3:7–36
Elmasri R, Navathe SB (2006) Fundamentals of database systems. Addison Wesley, Boston
Garriga GC, Khardon R, De Raedt L (2007) On mining closed sets in multi-relational data. In: Proceedings of the 20th international joint conference on artifical intelligence (IJCAI), pp 804–809
Geerts F, Goethals B, Mielikainen T (2004) Tiling databases. In: Proceedings of discovery science, pp 278–289
Geng L, Hamilton HJ (2006) Interestingness measures for data mining: a survey. In: ACM computing surveys, vol 38. ACM, New York
Gionis A, Mannila H, Mielikinen T, Tsaparas P (2007) Assessing data mining results via swap randomization. ACM Trans Knowl Discov Data 1(3):14
Goethals B, Le Page W (2008) Mining association rules of simple conjunctive queries. In: Proceedings of the SIAM international conference on data mining (SDM), Atlanta
Goethals B, Page WL, Mampaey M (2010) Mining interesting sets and rules in relational databases. In: Proceedings of the ACM symposium on applied computing (SAC), pp 997–1001
Gupta R, Fang G, Field B, Steinbach M, Kumar V (2008) Quantitative evaluation of approximate frequent pattern mining algorithms. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 301–309
Hanhijarvi S, Ojala M, Vuokko N, Puolamaki K, Tatti N, Mannila H (2009) Tell me something i don’t know: randomization strategies for iterative data mining. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining (KDD). ACM, New York, pp 379–388
Jäschke R, Hotho A, Schmitz C, Ganter B, Stumme G (2008) Discovering shared conceptualizations in folksonomies. Web Semant 6(1):38–53
Jen TY, Laurent D, Spyratos N (2010) Computing supports of conjunctive queries on relational tables with functional dependencies. Fundam Inf 99(3):263–292
Ji M, Han J, Danilevsky M (2011) Ranking-based classification of heterogeneous information networks. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 1298–1306
Ji M, Sun Y, Danilevsky M, Han J, Gao J (2010) Graph regularized transductive classification on heterogeneous information networks. In: ECML/PKDD (1), pp 570–586
Ji L, Tan KL, Tung AKH (2006) Mining frequent closed cubes in 3d datasets. In: Proceedings of the international conference on very large data bases, VLDB endowment, VLDB, pp 811–822
Kontonasios K, Spyropoulou E, De Bie T (2012) Knowledge discovery interestingness measures based on unexpectedness. In: Wiley interdisciplinary reviews: data mining and knowledge discovery, pp 386–399
Koopman A, Siebes A (2008) Discovering relational item sets efficiently. In: Proceedings of the SIAM conference on data mining (SDM), pp 108–119
Koopman A, Siebes A (2009) Characteristic relational patterns. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 437–446
Korte B, Lovász L (1985) Relations between subclasses of greedoids. Math Methods Oper Res 29:249–267
Kuramochi M, Karypis G (2001) Frequent subgraph discovery. In: Proceedings of the IEEE international conference on data mining (ICDM), pp 313–320
Lawler EL, Lenstra JK, Kan AHGR (1980) Generating all maximal independent sets: Np-hardness and polynomial-time algorithms. SIAM J Comput 9(3):558–565
Makino K, Uno T (2004) New algorithms for enumerating all maximal cliques. In: Scandinavia workshop on algorithm theory (SWAT), pp 260–272
Maruhashi K, Guo F, Faloutsos C (2011) Multiaspectforensics: Pattern mining on large-scale heterogeneous networks with tensor analysis. In: Proceedings of the international conference on advances in social networks analysis and mining, ASONAM ’11, pp 203–210
Ng EKK, Ng K, Fu AWC, Wang K (2002) Mining association rules from stars. In: Proceedings of the IEEE international conference on data mining (ICDM), pp 322–329
Nijssen S, Jiménez A, Guns T (2011) Constraint-based pattern mining in multi-relational databases. In: ICDM workshops, pp 1120–1127
Nijssen S, Kok J (2003) Efficient frequent query discovery in FARMER. In: Proceedings of the European conference on principles and practice of knowledge discovery in databases (PKDD), pp 350–362
Ojala M, Garriga GC, Gionis A, Mannila H (2010) Evaluating query result significance in databases via randomizations. In: Proceedings of the SIAM conference on data mining (SDM), pp 906–917
Pardalos PM, Xue J (1994) The maximum clique problem. J Glob Optim 4:301–328
Poernomo AK, Gopalkrishnan V (2009) Towards efficient mining of proportional fault-tolerant frequent itemsets. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 697–706
Siebes A, Vreeken J, van Leeuwen M (2006) Item sets that compress. In: Proceedings of the SIAM conference on data mining (SDM), pp 393–404
Spyropoulou E, De Bie T (2011) Interesting multi-relational patterns. In: Proceedings of the IEEE international conference on data mining (ICDM), pp 675–684
Srikant R, Agrawal R (1996) Mining quantitative association rules in large relational tables. In: Proceedings of the ACM SIGMOD international conference on management of data, pp 1–12
Sun Y, Han J, Aggarwal CC, Chawla NV (2012a) When will it happen?: relationship prediction in heterogeneous information networks. In: Proceedings of the fifth ACM international conference on Web search and data mining, WSDM ’12, pp 663–672
Sun Y, Norick B, Han J, Yan X, Yu PS, Yu X (2012b) Integrating meta-path selection with user-guided object clustering in heterogeneous information networks. In: KDD, pp 1348–1356
Sun Y, Yu Y, Han J (2009) Ranking-based clustering of heterogeneous information networks with star network schema. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 797–806
Tang L, Wang X, Liu H (2012) Community detection via heterogeneous interaction analysis. Data Min Knowl Discov 25(1):1–33
Trabelsi C, Jelassi N, Ben Yahia S (2012) Scalable mining of frequent tri-concepts from folksonomies. In: Advances in knowledge discovery and data mining, pp 231–242
Uno T, Asai T, Uchida Y, Arimura H (2004a) An efficient algorithm for enumerating closed patterns in transaction databases. In: Discovery science, pp 16–31
Uno T, Kiyomi M, Arimura H (2004b) Lcm ver. 2: Efficient mining algorithms for frequent/closed/maximal itemsets. In: Proceedings of the IEEE ICDM workshop on frequent itemset mining implementations (FIMI), Brighton
Voutsadakis G (2002) Polyadic concept analysis. Order 19(3):295–304
Yahia B, Hamrouni T, Nguifo EM (2006) Frequent closed itemset based algorithms: a thorough structural and analytical survey. SIGKDD Explor Newsl 8(1):93–104
Yan X, Han J (2002) gspan: Graph-based substructure pattern mining. In: Proceedings of the IEEE international conference on data mining (ICDM), pp 721–730
Yan X, Han J (2003) Closegraph: mining closed frequent graph patterns. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 286–295
Zaki MJ (2000) Scalable algorithms for association mining. IEEE Trans Knowl Data Eng 12(3):372–390
Zaki M, Hsiao CJ (2005) Efficient algorithms for mining closed itemsets and their lattice structure. IEEE Trans Knowl Data Eng 17(4):462–478
Zaki MJ, Peters M, Assent I, Seidl T (2007) Clicks: an effective algorithm for mining subspace clusters in categorical datasets. Data Knowl Eng 60(1):51–70
Zaki M, Hsiao CJ (2002) CHARM: an efficient algorithm for closed itemset mining. In: Proceedings of the SIAM international conference on data mining (SDM), pp 457–473
Zaki M, Ogihara M (1998) Theoretical foundations of association rules. In: Proceedings of the ACM SIGMOD workshop on research issues in data mining and knowledge discovery, San Diego
Acknowledgments
We are grateful to Michael Mampaey for providing the Smurfig code and data and for his support in using Smurfig, Siegfried Nijssen for his assistance in using Farmer and Thomas Gärtner for discussions on this work. This work was partially funded by PASCAL 2 Network of Excellence. Eirini Spyropoulou and Tijl De Bie are supported by EPSRC Grant EP/G056447/1. Mario Boley is partially funded by DFG (German National Research Foundation) under GA 1615/2-1.
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: M.J. Zaki.
Rights and permissions
About this article
Cite this article
Spyropoulou, E., De Bie, T. & Boley, M. Interesting pattern mining in multi-relational data. Data Min Knowl Disc 28, 808–849 (2014). https://doi.org/10.1007/s10618-013-0319-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-013-0319-9