Data Editing for Semi-Supervised Co-Forest by the Local Cut Edge Weight Statistic Graph
Keywords:semi supervised learning, data editing, Co-forest, Ensemble methods, medical diagnosis.
In order to address the large amount of unlabeled training data problem, many semi-supervised algorithms have been proposed. The training data in semi-supervised learning may contain much noise due to the insufficient number of labeled data in training set. Such noise may snowball themselves in the following learning process and thus hurt the generalization ability of the final hypothesis. If such noise could be identified and removed by some strategy, the performance of the semi-supervised algorithms should be improved. However, such useful techniques of identifying and removing noise have been seldom explored in existing semi-supervised algorithms. In this paper, we use the semi-supervised ensemble method “Co-forest” with data editing (we call it CEWS-Co-forest) to improve sparsely labeled medical dataset. The cut edges weight statistic data editing technique is used to actively identify possibly mislabeled examples in the newly-labeled data throughout the co-labeling iterations in Co-forest. The fusion of semi-supervised ensemble method with data editing makes CEWS-co-Forest more robust to the sparsity and the distribution bias of the training data. It further simplifies the design of semi-supervised learning which makes CEWS-co-forest more efficient. An experimental study on several medical data sets shows encouraging results compared with state-of-the-art methods.
(1) O. Chapelle, B. Sch¨ and A. Zien, Semi-Supervised Learning,MIT Press, Cambridge, MA, 2006.
(2) Antoine Cornu´ and Laurent Miclet, Apprentissage artiﬁciel : Concepts et algorithmes, Eyrolles, June 2010.
(3) Xiaojin Zhu, “Semi-Supervised learning literature survey,” Tech. Rep.,Computer Sciences, University of Wisconsin-Madison, 2005.
(4) Avrim Blum and Tom Mitchell, “Combining labeled and unlabeled datawith co-training,” in Proceedings of the eleventh annual conference onComputational learning theory, New York, NY, USA, 1998, COLT’ 98,pp. 92–100.
(5) L. G. Valiant, “A theory of the learnable,” Commun. ACM, vol. 27, no.11, pp. 1134–1142, Nov. 1984.
(6) L. Breiman, “Random forests,” Machine Learning, vol. 45, pp. 5–32,2001.
(7) Ming Li and Zhi-Hua Zhou, “Improve computer-aided diagnosis withmachine learning techniques using undiagnosed samples,” Trans. Sys.Man Cyber. Part A, vol. 37, no. 6, pp.
–1098, Nov. 2007.
(8) Yan Zhou and Sally Goldman, “Democratic co-learning,” in Proceedingsof the 16th IEEE International Conference on Tools with ArtiﬁcialIntelligence, Washington, DC, USA, 2004, ICTAI ’04, pp. 594–202,IEEE Computer Society.
(9) Zhi-Hua Zhou and Ming Li, “Tri-training: Exploiting unlabeled datausing three classiﬁers,” IEEE Trans. on Knowl. and Data Eng., vol. 17,no. 11, pp. 1529–1541, Nov. 2005.
(10) Boaz Leskes and Leen Torenvliet, “The value of agreement a newboosting algorithm,” J. Comput. Syst. Sci., vol. 74, no. 4, pp. 557–586,June 2008.
(11) Yuan Jiang and Zhi hua Zhou, “Editing training data for knn classiﬁerswith neural network ensemble,” in Lecture Notes in Computer Science,Vol.3173. 2004, pp. 356–361, Springer.
(12) T. Cover and P. Hart, “Nearest neighbor pattern classiﬁcation,” IEEETrans. Inf. Theor., vol. 13, no. 1, pp. 21–27, Sept. 2006.
(13) E. Fix and J. L. Hodges, “Discriminatory analysis. Nonparametricdiscrimination: consistency properties,” Jr. U.S. Air Force Sch. AviationMedicine, Randolf Field, vol. Rep.4, pp. Project 21–49–004, ContractAF 41 (128)–31, 1951.
(14) Fabrice Muhlenbach, St[U+FFFD]ane Lallich, and Djamel A. Zighed,“Identifying and handling mislabelled instances.,” J. Intell. Inf. Syst.,vol. 22, no. 1, pp. 89–109, 2004.
(15) Godfried T. Toussaint, “The relative neighbourhood graph of a ﬁniteplanar set,” Pattern Recognition, vol. 12, pp. 261–268, 1980.
(16) L. Devroye, L. Gy and G. Lugosi, A Probabilistic Theory of PatternRecognition, Springer, 1996.
(17) Yu Wang, Xiaoyan Xu, Haifeng Zhao, and Zhongsheng Hua,
“Semi-supervised learning based on nearest neighbor rule and cut edges,”Know.-Based Syst., vol. 23, no. 6, pp. 547–554, Aug. 2010.
(18) Reza Zafarani and Huan Liu, “Asu repository of social computingdatabases,” 1998.
(19) D.J. Newman, S. Hettich, C.L. Blake, and C.J. Merz, “Uci repositoryof machine learning databases,” 1998.
(20) Lytras, Miltiadis D., and Paraskevi Papadopoulou. "Applying Big Data Analytics in Bioinformatics and Medicine." IGI Global, 2018. 1-402. Web. 6 Apr. 2017. doi:10.4018/978-1-5225-2607-0.