Abstract
To cleanse mislabeled examples from a training dataset for efficient and effective induction, most existing approaches adopt a major set oriented scheme: the training dataset is separated into two parts (a major set and a minor set). The classifiers learned from the major set are used to identify noise in the minor set. The obvious drawbacks of such a scheme are twofold: (1) when the underlying data volume keeps growing, it would be either physically impossible or time consuming to load the major set into the memory for inductive learning; and (2) for multiple or distributed datasets, it can be either technically infeasible or factitiously forbidden to download data from other sites (for security or privacy reasons). Therefore, these approaches have severe limitations in conducting effective global data cleansing from large, distributed datasets.
In this paper, we propose a solution to bridge the local and global analysis for noise cleansing. More specifically, the proposed effort tries to identify and eliminate mislabeled data items from large or distributed datasets through local analysis and global incorporation. For this purpose, we make use of distributed datasets or partition a large dataset into subsets, each of which is regarded as a local subset and is small enough to be processed by an induction algorithm at one time to construct a local model for noise identification. We construct good rules from each subset, and use the good rules to evaluate the whole dataset. For a given instance I k , two error count variables are used to count the number of times it has been identified as noise by all data subsets. The instance with higher error values will have a higher probability of being a mislabeled example. Two threshold schemes, majority and non-objection, are used to identify and eliminate the noisy examples. Experimental results and comparative studies on both real-world and synthetic datasets are reported to evaluate the effectiveness and efficiency of the proposed approach.














Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Aha, D., Kibler, D., and Albert, M. 1991. Instance-based learning algorithms. Machine Learning, 6(1):37–66.
Blake, C.L. and Merz, C.J. 1998. UCI Repository of Machine Learning Databases.
Breiman, L., Friedman, J.H., Olshen, R., and Stone, C. 1984. Classification and Regression Trees. Wadsworth & Brooks, CA.
Brodley, C.E. and Friedl, M.A. 1996. Identifying and eliminating mislabeled training instances. Proc. of 13th National Conf. on Artificial Intelligence, pp.799–805.
Brodley, C.E. and Friedl, M.A. 1999. Identifying mislabeled training data. Journal of Artificial Intelligence Research, 11:131–167.
Bruha, I. and Franek, F. 1996. Comparison of various routines for unknown attribute value processing the covering paradigm. IJPRAI 10(8):939–955
Cestnik, B., Kononenko, I., and Bratko, I. 1987. ASSISTANT 86: A knowledge elicitation tool for sophisticated users, Proc. of 2nd European Working Session on Learning, Sigma Press, 1987. pp. 31–45.
Cendrowska, J. 1987. Prism: An algorithm for inducing modular rules. International Journal of Man-Machines Studies, 27:349–370.
Chan, P.K.-W. 1996. An extensive meta-learning approach for scalable and accurate inductive learning, Ph.D. Thesis, Columbia University.
Clark, P. and Niblett, T. 1989. The CN2 induction algorithm. Machine Learning, 3(4):261–283.
Clark, P. and Boswell, R. 1991. Rule induction with CN2: Some recent improvement. Proc. of 5th ECML, Berlin, Springer-Verlag.
Dietterich, T. 2000. Ensemble methods in machine learning. In Lecture Notes in Computer Science Vol. 1867, J. Kittler and F. Roli, (Eds.), Springer, Berlin: pp. 1–15.
Gamberger, D., Lavrac, N., and Groselj, C. 1999. Experiments with noise filtering in a medical domain. Proc. of 16th ICML Conference, San Francisco, CA, pp. 143–151.
Gamberger, D. Lavrac, N., and Dzeroski, S. 2000. Noise detection and elimination in data preprocessing: Experiments in medical domains. Applied Artificial Intelligence, 14:205–223.
Grzymala-Busse, J.W. and Hu, M. 2000. A comparison of several approaches to missing attribute values in data mining. Rough Sets and Current Trends in Computing, pp. 378–385.
Guyon, I. Matic, N., and Vapnik, V. 1996. Discovering information patterns and data cleaning. Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, pp. 181—203.
Friedman, J.H. 1977. A recursive partitioning decision rule for nonparametric classification. IEEE Transaction on Computers, 26(4):404–408.
Hall, L., Bowyer, K., Kegelmeyer, W., Moore, T., and Chao, C. 2000. Distributed learning on very large data sets, KDD-00 Workshop on Distributed and Parallel Knowledge Discovery, pp. 79–84.
Holte, R.C. 1993. Very simple classification rules perform well on most commonly used datasets. Machine Learning, 11.
Huang, C.C. and Lee, H.M. 2001. A grey-based nearest neighbor approach for predicting missing attribute values. Proc. of 2001 National Computer Symposium, Taiwan, NSC-90-2213-E-011-052.
IBM Synthetic Data. IBM Almaden Research, Synthetic classification data generator, http://www.almaden.ibm.com/software/quest/Resources/datasets/syndata.html#classSynData.
John, G.H. 1995. Robust decision trees: Removing outliers from databases. Proc. of the First International Conference on Knowledge Discovery and Data Mining, AAAI Press, pp. 174–179.
Kononenko, I., Bratko, I., and Roskar, E. 1984. Experiments in automatic learning of medical diagnostic rules, Technical Report, Jozef Stefan Institute, Ljubljana, Yugoslavia.
Kubica, J. and Moore, A. 2003. Probabilistic noise identification and data cleaning. Proc. of ICDM, FL, USA
Lewis, D. and Catlett, J. 1994. Heterogeneous uncertainty sampling for supervised learning. Proc. of the 11th ICML Conference, NJ, Morgan Kaufmann: pp. 148–156.
Li, Q. Li, T., Zhu, S., and Kambhamettu, C. 2002. Improving medical/biological data classification performance by wavelet preprocessing. Proc. of International Conference on Data Mining (ICDM 2002), Japan.
Michalski, R.S., Mozetic, I., Hong, J., and Lavrac, N. 1986. The multi-purpose incremental learning system AQ15 and its testing application to three medical domains. Proceedings of AAAI, pp. 1041–1045.
Oak, N. and Yoshida, K. 1993. Learning regular and irregular examples separately. Proc. of IEEE International Joint Conference on Neural Networks, pp. 171–174.
Oak, N. and Yoshida, K. 1996. A noise-tolerant hybrid model of a global and a local learning model. Proc. of AAAI-96 Workshop: Integrating Multiple Learned Models for Improving and Scaling Machine Learning Algorithm, pp. 95–100.
Provost, F., Jensen, D., and Oates, T. 1999. Efficient progressive sampling. Proc. of the 5th ACM SIGKDD, pp. 23–32.
Provost, F. and Kolluri, V. 1999. A survey of methods for scaling up inductive algorithms. Data Mining and Knowledge Discovery, 3(2):131–169.
Quinlan, J.R. 1986. Induction of decision trees. Machine Learning, 1(1):81–106.
Quinlan, J.R. 1989. Unknown attribute values in induction. Proceedings of the 6th International Workshop on Machine Learning, pp. 164–168.
Quinlan, J.R. 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA.
Schapire, R.E. 1990. The strength of weak learnability. Machine Learning, 5(2):197–227.
Shapiro, A. 1983. The role of structured induction in expert systems, Ph.D Thesis, University of Edinburgh.
Skalak, D. 1994. Prototype and feature selection by sampling and random mutation hill climbing algorithms, Proc. of 11th ICML Conference, New Brunswick, NJ. Morgan Kaufmann, pp. 293–301.
Srinivasan, A., Muggleton, S., and Bain, M. 1992. Distinguishing exception from noise in non-monotonic learning. Proc. of 2nd Inductive Logic Programming Workshop, pp. 97–107.
Teng, C.M. 1999. Correcting noisy data. Proc. of International Conference on Machine Learning, pp. 239–248.
Tomek, I. 1976. An experiment with edited nearest-neighbor rule. IEEE Trans. on Sys. Man and Cyber., 6(6):448–452.
Verbaeten, S. 2002. Identifying mislabeled training examples in ILP classification problems. Proc. of Benelearn, Annual Machine Learning Conf. of Belgium and the Netherlands.
Weisberg, S. 1980. Applied Linear Regression, John Wiley and Sons, Inc.
Weiss, G.M. 1995. Learning with rare cases and small Disjunctions. Proc. of 12th International Conference on Machine Learning, Morgan Kaufmann, pp. 558–565.
Weiss, G.M. and Hirsh, H. 1998. The problem with noise and small disjuncts. Proc. of 15th International Conference on Machine Learning, San Francisco, CA, pp. 574–578.
Whitaker, A. and Saroiu, S. 1999. Cleaning mislabeled training data using SMART FILTER. Second Project Report for CSE573–Artificial Intelligence, Prof. Pedro Domingos.
Wilson, D. 1972. Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans. on SMC, 2:408–421.
Wilson, D. and Martinez, T.R. 2000. Reduction techniques for examplar-based learning algorithms. Machine Learning, 38(3):257–268.
Winston, P. 1975. Learning structural descriptions from examples. The Psychology of Computer Vision, McGraw-Hill, New York.
Wu, X. 1995. Knowledge Acquisition from Database, Ablex Pulishing Corp., USA.
Wu, X. 1998. Rule induction with extension matrices. American Society for Information Science, 49(5):435–454.
Zhao, Q. and Nishida, T. 1995. Using qualitative hypotheses to identify inaccurate data. Journal of Artificial Intelligence Research, 3:119–145.
Zhu, X., Wu, X., and Yang, Y. 2004. Error detection and impact-sensitive instance ranking in noisy datasets. Proceedings of the 19th National Conference on Artificial Intelligence (AAAI-2004), July 25–29, San Jose, California.
Zhu, X. and Wu, X. 2004. Class noise vs attribute noise: A quantitative study of their impacts. Artificial Intelligence Review, 22(3-4):177–210.
Author information
Authors and Affiliations
Corresponding author
Additional information
A preliminary version of this paper was published in the Proceedings of the 20th International Conference on Machine Learning, Washington D.C., USA, 2003, pp. 920–927.
Rights and permissions
About this article
Cite this article
ZHU, X., WU, X. & CHEN, Q. Bridging Local and Global Data Cleansing: Identifying Class Noise in Large, Distributed Data Datasets. Data Min Knowl Disc 12, 275–308 (2006). https://doi.org/10.1007/s10618-005-0012-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-005-0012-8