Abstract
Graph classification has been showing critical importance in a wide variety of applications, e.g. drug activity predictions and toxicology analysis. Current research on graph classification focuses on single-label settings. However, in many applications, each graph data can be assigned with a set of multiple labels simultaneously. Extracting good features using multiple labels of the graphs becomes an important step before graph classification. In this paper, we study the problem of multi-label feature selection for graph classification and propose a novel solution, called gMLC, to efficiently search for optimal subgraph features for graph objects with multiple labels. Different from existing feature selection methods in vector spaces that assume the feature set is given, we perform multi-label feature selection for graph data in a progressive way together with the subgraph feature mining process. We derive an evaluation criterion to estimate the dependence between subgraph features and multiple labels of graphs. Then, a branch-and-bound algorithm is proposed to efficiently search for optimal subgraph features by judiciously pruning the subgraph search space using multiple labels. Empirical studies demonstrate that our feature selection approach can effectively boost multi-label graph classification performances and is more efficient by pruning the subgraph search space using multiple labels.
Similar content being viewed by others
References
Borgelt C, Berthold M (2002) Mining molecular fragments: Finding relevant substructures of molecules. In: Proceedings of the 2nd IEEE international conference on data mining. Maebashi City, Japan, pp 211–218
Borgwardt KM (2007) Graph Kernels. PhD thesis, Ludwig-Maximilians-University Munich
Boutell MR, Luo J, Shen X, Brown CM (2004) Learning multi-label scene classification. Pattern Recognit 37(9): 1757–1771
Chen C, Yan X, Zhu F, Han J, Yu P (2009) Graph OLAP: a multi-dimensional framework for graph data analysis. Knowl Inf Syst 21(1): 41–63
Comité FD, Gilleron R, Tommasi M (2003) Learning multi-label alternating decision tree from texts and data. In: Proceedings of the 3rd international conference on machine learning and data mining in pattern recognition. Leipzig, Germany, pp 35–49
Elisseeff A, Weston J (2002) A kernel method for multi-labelled classification. Adv Neural Inf Process Syst 14: 681–687
Fei H, Huan J (2010) Boosting with structure information in the functional space: an application to graph classification. In: Proceedings of the 16th ACM SIGKDD conference on knowledge discovery and data mining. Washington, DC, pp 643–652
Godbole S, Sarawagi S (2004) Discriminative methods for multi-labeled classification. In: Proceedings of the 8th Pacific-Asia conference on knowledge discovery and data mining. Sydney, Australia, pp 22–30
Gretton A, Bousquet O, Smola A, Schölkopf B (2005) Measuring statistical dependence with Hilbert-Schmidt norms. ALT, Singapore, pp 63–77
Helma C, King R, Kramer S, Srinivasan A (2001) The predictive toxicology challenge 2000–2001. Bioinformatics 17(1): 107–108
Huan J, Wang W, Prins J (2003) Efficient mining of frequent subgraph in the presence of isomorphism. In: Proceedings of the 3rd IEEE international conference on data mining. Melbourne, FL, pp 549–552
Inokuchi A, Washio T, Motoda H (2000) An apriori-based algorithm for mining frequent substructures from graph data. In: Proceedings of the 4th European conference on principles of data mining and knowledge discovery. Lyon, France, pp 13–23
Jia Y, Tao J, Huan J (2011) An efficient graph-mining method for complicated and noisy data with real-world applications. Knowl Inf Syst, pp 1–25
Kashima H, Tsuda K, Inokuchi A (2003) Marginalized kernels between labeled graphs. In: Proceedings of the 20th international conference on machine learning. Washington, DC, pp 321–328
Kazawa H, Izumitani T, Taira H, Maeda E (2005) Maximal margin labeling for multi-topic text categorization. Adv Neural Inf Process Syst 15: 649–656
Kong X, Yu P (2010) Semi-supervised feature selection for graph classification. In: Proceedings of the 16th ACM SIGKDD conference on knowledge discovery and data mining. Washington, DC, pp 793–802
Kudo T, Maeda E, Matsumoto Y (2005) An application of boosting to graph classification. Adv Neural Inf Process Syst 15: 729–736
Kuramochi M, Karypis G (2001) Frequent subgraph discovery. In: Proceedings of the 1st IEEE international conference on data mining. San Jose, CA, pp 313–320
McCallum A (1999) Multi-label text classification with a mixture model trained by EM. Working notes of the AAAI’99 Workshop on text learning, Orlando, FL
Nijssen S, Kok J, (2004) A quickstart in frequent structure mining can make a difference. In: Proceedings of the 10th ACM SIGKDD conference on knowledge discovery and data mining. Seattle, WA, pp 647–652
Schapire RE, Singer Y (2000) Boostexter: a boosting-based system for text categorization. Mach Learn 39(2–3): 135–168
Tsoumakas G, Vlahavas I (2007) Random k-labelsets: an ensemble method for multilabel classification. In: Proceedings of the 18th European conference on machine learning. Warsaw, Poland, pp 406–417
Tasourakakis C, Kang C, Faloutsos C (2010) Pegasus: mining peta-scale graphs. Knowl Inf Syst, pp 1–23
Thoma M, Cheng H, Gretton A, Han J, Kriegel H, Smola A, Song L, Yu P, Yan X, Borgwardt K (2009) Near-optimal supervised feature selection among frequent subgraphs. In: Proceedings of the 9th SIAM international conference on data mining. Sparks, Nevada, pp 1075–1086
Ueda N, Saito K (2003) Parametric mixture models for multi-labeled text. Adv Neural Inf Process Syst 13: 721–728
Yan X, Cheng H, Han J, Yu P (2008) Mining significant graph patterns by leap search. In: Proceedings of the ACM SIGMOD international conference on management of data. Vancouver, BC, pp 433–444
Yan X, Han J (2002) gSpan: Graph-based substructure pattern mining. In: Proceedings of the 2nd IEEE international conference on data mining. Maebashi City, Japan, pp 721–724
Ying X, Wu X (2010) On link privacy in randomizing social networks. Knowl Inf Syst, pp 1–19
Zhang M-L, Zhou Z-H (2007) Ml-knn: a lazy learning approach to multi-label learning. Pattern Recognit 40(7): 2038–2048
Zhang Y, Zhou Z-H (2008) Multi-label dimensionality reduction via dependency maximization. In: Proceedings of the 23rd AAAI conference on artificial intelligence. Chicago, IL, pp 1053–1055
Zou Z, Gao H, Li J (2010) Discovering frequent subgraphs over uncertain graph databases under probabilistic semantics. In: Proceedings of the 16th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, pp 633–642
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Kong, X., Yu, P.S. gMLC: a multi-label feature selection framework for graph classification. Knowl Inf Syst 31, 281–305 (2012). https://doi.org/10.1007/s10115-011-0407-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-011-0407-3