Abstract
In many large e-commerce organizations, multiple data sources are often used to describe the same customers, thus it is important to consolidate data of multiple sources for intelligent business decision making. In this paper, we propose a novel method that predicts the classification of data from multiple sources without class labels in each source. We test our method on artificial and real-world datasets, and show that it can classify the data accurately. From the machine learning perspective, our method removes the fundamental assumption of providing class labels in supervised learning, and bridges the gap between supervised and unsupervised learning.



Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
The integer in parentheses (for example 384) means there are 384 instances in this leaf or cluster. All the trees in the paper are represented in the same format as the output of C4.5 (Quinlan, 1993).
Normally the partition trees are different from (and larger than) the ideal ones, as shown in later subsections on incomplete and noisy datasets.
This is suggested by Doug Fisher.
These two datasets have a large number of discrete attributes. Recall that CMS currently works only on discrete attributes.
References
Blum, A. and Mitchell, T. 1998. Combining labeled and unlabeled data with co-training. In Proceedings of the Eleventh Annual Conference on Computational Learning Theory, pp. 92–100.
Cheeseman, P. and Stutz, J. 1996. Bayesian classification (AutoClass): Theory and results. In Advances in Knowledge Discovery and Data Mining, U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy (Eds.), AAAI Press/MIT Press.
Church, K.W. and Hanks, P. 1989. Word association norms, mutual information, and lexicography. In Proceedings of the 27th. Annual Meeting of the Association for Computational Linguistics, Vancouver, B.C. Association for Computational Linguistics, pp. 76–83.
de Sa, V. 1994a. Learning classification with unlabeled data. In Advances in Neural Information Processing Systems, J. Cowan, G. Tesauro, and J. Alspector (Eds.), vol. 6, pp. 112–119.
de Sa, V. 1994b. Minimizing disagreement for self-supervised classification. In Proceedings of the 1993 Connectionist Models Summer School, M. Mozer, P. Smolensky, D. Touretzky, and A. Weigend (Eds.), pp. 300–307.
de Sa, V. and Ballard, D. 1998. Category learning through multi-modality sensing. Neural Computation, 10(5).
Fisher, D. 1987. Knowledge acquisition via incremental conceptual clustering. Machine Learning, 2:139–172.
Kohavi, R. and John, G. 1997. Wrappers for feature subset selection. Artificial Intelligence, 97(1–2):273–324.
Lu, S. and Chen, K. 1987. A machine learning approach to the automatic synthesis of mechanistic knowledge for engineering decision-making. Artificial Intelligence for Engineering Design, Analysis, and Manufacturing, 1:109–118.
Murphy, P.M. and Aha, D.W. 1992. UCI Repository of Machine Learning Databases [Machine-readable data repository]. Irvine, CA, University of California, Department of Information and Computer Science.
Nigam, K. and Ghani, R. 2000. Analyzing the effectiveness and applicability of co-training. In Proceedings of the Ninth International Conference on Information and Knowledge Management, pp. 86–93.
Quinlan, J. 1993. C4.5: Programs for Machine Learning. San Mateo, CA, Morgan Kaufmann.
Raskutti, B., Ferra, H., and Kowalczyk, A. 2002. Combining clustering and co-training to enhance text classification using unlabelled data. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 620–625.
Reich, Y. 1992. Ecobweb: Preliminary user's manual. Tech. rep., Department of Civil Engineering, Carnegie Mellon University.
Reich, Y. and Fenves, S. 1991. The formation and use of abstract concepts in design. In Concept Formation: Knowledge and Experience in Unsupervised Learning, D. Fisher, M. Pazzani, and P. Langley (Eds.), Morgan Kaufmann, CA.
Reich, Y. and Fenves, S. 1992. Inductive learning of synthesis knowledge. International Journal of Expert Systems: Research and Applications, 5(4):275–297.
Sinkkonen, J., Nikkil, J., Lahti, L., and Kaski, S. 2004. Associative clustering. In Proceedings of 15th European Conference on Machine Learning (ECML 2004), pp. 396–406.
Turney, P. (1993). Exploiting context when learning to classify. In Proceedings of ECML-93, pp. 402–407.
Wu, X. and Zhang, S. 2003. Synthesizing high-frequency rules from different data source. IEEE Transactions on Knowledge and Data Engineering, 15(2):353–367.
Yao, Y., Chen, L., Goh, A., and Wong, A. 2002. Clustering gene data via associative clustering neural network. In Proceedings of the 9th International Conference on Neural Information Processing (ICONIP 2002), pp. 2228–2232.
Zhang, S., Wu, X., and Zhang, C. 2003. Multi-database mining. IEEE Computational Intelligence Bulletin, 2(1):5–13.
Acknowledgments
We thank Doug Fisher and Joel Martin for their extensive and insightful comments and suggestions on the earlier versions of the paper. We also thank Chenghui Li for discussions and working with CMS. Qiang Yang thanks the support of Hong Kong RGC grant HKUST 6187/04E.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Ling, C.X., Yang, Q. Discovering Classification from Data of Multiple Sources. Data Min Knowl Disc 12, 181–201 (2006). https://doi.org/10.1007/s10618-005-0013-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-005-0013-7