Bridging Local and Global Data Cleansing: Identifying Class Noise in Large, Distributed Data Datasets

ZHU, XINGQUAN; WU, XINDONG; CHEN, QIJUN

doi:10.1007/s10618-005-0012-8

Bridging Local and Global Data Cleansing: Identifying Class Noise in Large, Distributed Data Datasets

Published: 04 April 2006

Volume 12, pages 275–308, (2006)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

XINGQUAN ZHU¹,
XINDONG WU¹ &
QIJUN CHEN¹

576 Accesses
Explore all metrics

Abstract

To cleanse mislabeled examples from a training dataset for efficient and effective induction, most existing approaches adopt a major set oriented scheme: the training dataset is separated into two parts (a major set and a minor set). The classifiers learned from the major set are used to identify noise in the minor set. The obvious drawbacks of such a scheme are twofold: (1) when the underlying data volume keeps growing, it would be either physically impossible or time consuming to load the major set into the memory for inductive learning; and (2) for multiple or distributed datasets, it can be either technically infeasible or factitiously forbidden to download data from other sites (for security or privacy reasons). Therefore, these approaches have severe limitations in conducting effective global data cleansing from large, distributed datasets.

In this paper, we propose a solution to bridge the local and global analysis for noise cleansing. More specifically, the proposed effort tries to identify and eliminate mislabeled data items from large or distributed datasets through local analysis and global incorporation. For this purpose, we make use of distributed datasets or partition a large dataset into subsets, each of which is regarded as a local subset and is small enough to be processed by an induction algorithm at one time to construct a local model for noise identification. We construct good rules from each subset, and use the good rules to evaluate the whole dataset. For a given instance I _k, two error count variables are used to count the number of times it has been identified as noise by all data subsets. The instance with higher error values will have a higher probability of being a mislabeled example. Two threshold schemes, majority and non-objection, are used to identify and eliminate the noisy examples. Experimental results and comparative studies on both real-world and synthetic datasets are reported to evaluate the effectiveness and efficiency of the proposed approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

FedHC: Learning Imbalanced Clusters via Federated Hierarchical Clustering

Label denoising based on Bayesian aggregation

Article 19 December 2015

Data Cleansing Using Clustering

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Aha, D., Kibler, D., and Albert, M. 1991. Instance-based learning algorithms. Machine Learning, 6(1):37–66.
Google Scholar
Blake, C.L. and Merz, C.J. 1998. UCI Repository of Machine Learning Databases.
Breiman, L., Friedman, J.H., Olshen, R., and Stone, C. 1984. Classification and Regression Trees. Wadsworth & Brooks, CA.
Google Scholar
Brodley, C.E. and Friedl, M.A. 1996. Identifying and eliminating mislabeled training instances. Proc. of 13th National Conf. on Artificial Intelligence, pp.799–805.
Brodley, C.E. and Friedl, M.A. 1999. Identifying mislabeled training data. Journal of Artificial Intelligence Research, 11:131–167.
Google Scholar
Bruha, I. and Franek, F. 1996. Comparison of various routines for unknown attribute value processing the covering paradigm. IJPRAI 10(8):939–955
Google Scholar
Cestnik, B., Kononenko, I., and Bratko, I. 1987. ASSISTANT 86: A knowledge elicitation tool for sophisticated users, Proc. of 2nd European Working Session on Learning, Sigma Press, 1987. pp. 31–45.
Cendrowska, J. 1987. Prism: An algorithm for inducing modular rules. International Journal of Man-Machines Studies, 27:349–370.
Google Scholar
Chan, P.K.-W. 1996. An extensive meta-learning approach for scalable and accurate inductive learning, Ph.D. Thesis, Columbia University.
Clark, P. and Niblett, T. 1989. The CN2 induction algorithm. Machine Learning, 3(4):261–283.
Google Scholar
Clark, P. and Boswell, R. 1991. Rule induction with CN2: Some recent improvement. Proc. of 5th ECML, Berlin, Springer-Verlag.
Dietterich, T. 2000. Ensemble methods in machine learning. In Lecture Notes in Computer Science Vol. 1867, J. Kittler and F. Roli, (Eds.), Springer, Berlin: pp. 1–15.
Google Scholar
Gamberger, D., Lavrac, N., and Groselj, C. 1999. Experiments with noise filtering in a medical domain. Proc. of 16th ICML Conference, San Francisco, CA, pp. 143–151.
Gamberger, D. Lavrac, N., and Dzeroski, S. 2000. Noise detection and elimination in data preprocessing: Experiments in medical domains. Applied Artificial Intelligence, 14:205–223.
Google Scholar
Grzymala-Busse, J.W. and Hu, M. 2000. A comparison of several approaches to missing attribute values in data mining. Rough Sets and Current Trends in Computing, pp. 378–385.
Google Scholar
Guyon, I. Matic, N., and Vapnik, V. 1996. Discovering information patterns and data cleaning. Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, pp. 181—203.
Friedman, J.H. 1977. A recursive partitioning decision rule for nonparametric classification. IEEE Transaction on Computers, 26(4):404–408.
Google Scholar
Hall, L., Bowyer, K., Kegelmeyer, W., Moore, T., and Chao, C. 2000. Distributed learning on very large data sets, KDD-00 Workshop on Distributed and Parallel Knowledge Discovery, pp. 79–84.
Holte, R.C. 1993. Very simple classification rules perform well on most commonly used datasets. Machine Learning, 11.
Google Scholar
Huang, C.C. and Lee, H.M. 2001. A grey-based nearest neighbor approach for predicting missing attribute values. Proc. of 2001 National Computer Symposium, Taiwan, NSC-90-2213-E-011-052.
IBM Synthetic Data. IBM Almaden Research, Synthetic classification data generator, http://www.almaden.ibm.com/software/quest/Resources/datasets/syndata.html#classSynData.
John, G.H. 1995. Robust decision trees: Removing outliers from databases. Proc. of the First International Conference on Knowledge Discovery and Data Mining, AAAI Press, pp. 174–179.
Kononenko, I., Bratko, I., and Roskar, E. 1984. Experiments in automatic learning of medical diagnostic rules, Technical Report, Jozef Stefan Institute, Ljubljana, Yugoslavia.
Kubica, J. and Moore, A. 2003. Probabilistic noise identification and data cleaning. Proc. of ICDM, FL, USA
Lewis, D. and Catlett, J. 1994. Heterogeneous uncertainty sampling for supervised learning. Proc. of the 11th ICML Conference, NJ, Morgan Kaufmann: pp. 148–156.
Li, Q. Li, T., Zhu, S., and Kambhamettu, C. 2002. Improving medical/biological data classification performance by wavelet preprocessing. Proc. of International Conference on Data Mining (ICDM 2002), Japan.
Michalski, R.S., Mozetic, I., Hong, J., and Lavrac, N. 1986. The multi-purpose incremental learning system AQ15 and its testing application to three medical domains. Proceedings of AAAI, pp. 1041–1045.
Oak, N. and Yoshida, K. 1993. Learning regular and irregular examples separately. Proc. of IEEE International Joint Conference on Neural Networks, pp. 171–174.
Oak, N. and Yoshida, K. 1996. A noise-tolerant hybrid model of a global and a local learning model. Proc. of AAAI-96 Workshop: Integrating Multiple Learned Models for Improving and Scaling Machine Learning Algorithm, pp. 95–100.
Provost, F., Jensen, D., and Oates, T. 1999. Efficient progressive sampling. Proc. of the 5th ACM SIGKDD, pp. 23–32.
Provost, F. and Kolluri, V. 1999. A survey of methods for scaling up inductive algorithms. Data Mining and Knowledge Discovery, 3(2):131–169.
Google Scholar
Quinlan, J.R. 1986. Induction of decision trees. Machine Learning, 1(1):81–106.
Google Scholar
Quinlan, J.R. 1989. Unknown attribute values in induction. Proceedings of the 6th International Workshop on Machine Learning, pp. 164–168.
Quinlan, J.R. 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA.
Schapire, R.E. 1990. The strength of weak learnability. Machine Learning, 5(2):197–227.
Google Scholar
Shapiro, A. 1983. The role of structured induction in expert systems, Ph.D Thesis, University of Edinburgh.
Skalak, D. 1994. Prototype and feature selection by sampling and random mutation hill climbing algorithms, Proc. of 11th ICML Conference, New Brunswick, NJ. Morgan Kaufmann, pp. 293–301.
Srinivasan, A., Muggleton, S., and Bain, M. 1992. Distinguishing exception from noise in non-monotonic learning. Proc. of 2nd Inductive Logic Programming Workshop, pp. 97–107.
Teng, C.M. 1999. Correcting noisy data. Proc. of International Conference on Machine Learning, pp. 239–248.
Tomek, I. 1976. An experiment with edited nearest-neighbor rule. IEEE Trans. on Sys. Man and Cyber., 6(6):448–452.
Google Scholar
Verbaeten, S. 2002. Identifying mislabeled training examples in ILP classification problems. Proc. of Benelearn, Annual Machine Learning Conf. of Belgium and the Netherlands.
Weisberg, S. 1980. Applied Linear Regression, John Wiley and Sons, Inc.
Weiss, G.M. 1995. Learning with rare cases and small Disjunctions. Proc. of 12th International Conference on Machine Learning, Morgan Kaufmann, pp. 558–565.
Weiss, G.M. and Hirsh, H. 1998. The problem with noise and small disjuncts. Proc. of 15th International Conference on Machine Learning, San Francisco, CA, pp. 574–578.
Whitaker, A. and Saroiu, S. 1999. Cleaning mislabeled training data using SMART FILTER. Second Project Report for CSE573–Artificial Intelligence, Prof. Pedro Domingos.
Wilson, D. 1972. Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans. on SMC, 2:408–421.
Google Scholar
Wilson, D. and Martinez, T.R. 2000. Reduction techniques for examplar-based learning algorithms. Machine Learning, 38(3):257–268.
Google Scholar
Winston, P. 1975. Learning structural descriptions from examples. The Psychology of Computer Vision, McGraw-Hill, New York.
Wu, X. 1995. Knowledge Acquisition from Database, Ablex Pulishing Corp., USA.
Google Scholar
Wu, X. 1998. Rule induction with extension matrices. American Society for Information Science, 49(5):435–454.
Google Scholar
Zhao, Q. and Nishida, T. 1995. Using qualitative hypotheses to identify inaccurate data. Journal of Artificial Intelligence Research, 3:119–145.
Google Scholar
Zhu, X., Wu, X., and Yang, Y. 2004. Error detection and impact-sensitive instance ranking in noisy datasets. Proceedings of the 19th National Conference on Artificial Intelligence (AAAI-2004), July 25–29, San Jose, California.
Zhu, X. and Wu, X. 2004. Class noise vs attribute noise: A quantitative study of their impacts. Artificial Intelligence Review, 22(3-4):177–210.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Vermont, Burlington, VT, 05405, USA
XINGQUAN ZHU, XINDONG WU & QIJUN CHEN

Authors

XINGQUAN ZHU
View author publications
You can also search for this author inPubMed Google Scholar
XINDONG WU
View author publications
You can also search for this author inPubMed Google Scholar
QIJUN CHEN
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to XINGQUAN ZHU.

Additional information

A preliminary version of this paper was published in the Proceedings of the 20th International Conference on Machine Learning, Washington D.C., USA, 2003, pp. 920–927.

Rights and permissions

Reprints and permissions

About this article

Cite this article

ZHU, X., WU, X. & CHEN, Q. Bridging Local and Global Data Cleansing: Identifying Class Noise in Large, Distributed Data Datasets. Data Min Knowl Disc 12, 275–308 (2006). https://doi.org/10.1007/s10618-005-0012-8

Download citation

Received: 03 April 2005
Accepted: 27 July 2005
Published: 04 April 2006
Issue Date: May 2006
DOI: https://doi.org/10.1007/s10618-005-0012-8

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Bridging Local and Global Data Cleansing: Identifying Class Noise in Large, Distributed Data Datasets

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

FedHC: Learning Imbalanced Clusters via Federated Hierarchical Clustering

Label denoising based on Bayesian aggregation

Data Cleansing Using Clustering

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now