COSM: Controlled Over-Sampling Method
A Methodological Proposal to Overcome the Class Imbalance Problem in Data Mining
Keywords:Class imbalance problem, Data Mining, Holdout Method, Oversampling, Rare Class Mining, Undersampling
The class imbalance problem is widespread in Data Mining and it can reduce the general performance of a classification model. Many techniques have been proposed in order to overcome it, thanks to which a model able to handling rare events can be trained. The methodology presented in this paper, called Controlled Over-Sampling Method (COSM), includes a controller model able to reject new synthetic elements for which there is no certainty of belonging to the minority class. It combines the common Machine Learning method for holdout with an oversampling algorithm, for example the classic SMOTE algorithm. The proposal explained and designed here represents a guideline for the application of oversampling algorithms, but also a brief overview on techniques for overcoming the problem of the unbalanced class in Data Mining.
(1) A. Ali, S. M. Shamsuddin, and A. L. Ralescu, Classification with class imbalance problem: A review, Int. J. Advance Soft Compu. Appl, vol. 7, no. 3, November 2015, pp. 176-204, ISSN: 2074-8523.
(2) S. M. A. Elrahman and A. Abraham, A Review of Class Imbalance Problem, Journal of Network and Innovative Computing, vol. 1, 2013, pp. 332-340, ISSN: 2160-2174.
(3) N. V. Chawla, N. Japkowicz, and A. Kotcz, Editorial: special issue on learning from imbalanced data sets, SIGKDD Explorations, vol. 6, no. 1, 2004, pp. 1-6.
(4) H. He and E. A. Garcia, Learning from imbalanced data, IEEE Transactions on Knowledge and Data Engineering, vol. 21, no. 9, September 2009, pp. 1263-1284.
(5) P. Brazdil, C. Giraud-Carrier, C. Soares, and R. Vilalta, Metalearning: applications to data mining, Springer-Verlag Berlin Heidelberg, 2009, doi: 10.1007/978-3-540-73263-1.
(6) S. Raschka, Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning, ArXiv abs/1811.12808, 2018.
(7) D. M. Hawkins, The Problem of Overfitting, Journal of Chemical Information and Computer Sciences 44(1):1-12, May 2004, doi: 10.1021/ci0342472.
(8) P. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining, Pearson Addison Wesley, 2005.
(9) N. Thai-Nghe, Z. Gantner, and L. Schmidt-Thieme, Cost-sensitive learning methods for imbalanced data, in The 2010 International Joint Conference on Neural Networks (IJCNN), Barcelona, 2010, pp. 1-8, doi: 10.1109/IJCNN.2010.5596486.
(10) C. Elkan, The Foundations of Cost-Sensitive Learning, in Proc. Of the 17th Intl. Joint Conf. on Artificial Intelligence, August 2001, pp. 973-978.
(11) M. Galar, A. Fernandez, E. Barrenechea, and H. Bustince, A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches, in IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), vol. 42, no. 4, pp. 463-484, 2012, doi: 10.1109/TSMCC.2011.2161285.
(12) W. C. Lin, C. F. Tsai, Y. H. Hu, and J. S. Jhang, Clustering-based undersampling in class-imbalanced data, Information Sciences, voll 409–410, October 2017, pp. 17-26, doi: doi.org/10.1016/j.ins.2017.05.008.
(13) N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, SMOTE: Synthetic Minority Over-sampling
Technique, in Journal of Artificial Intelligence Research 16, 2002, pp. 321-357.
(14) H. He, Y. Bai, E. A. Garcia, and S. Li, ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning, 2008 International Joint Conference on Neural Networks (IJCNN 2008), Honk Hong, 2008, pp. 1322-1328, doi: 10.1109/IJCNN.2008.4633969.
(15) A. Sánchez, E. Morales, and J. Gonzalez, Synthetic Oversampling of Instances Using Clustering, International Journal of Artificial Intelligence Tools, 22, 2013, doi: 10.1142/S0218213013500085.
(16) P. D. Turney, Cost-Sensitive Classification: Empirical Evaluation of a Hybrid Genetic Decision Tree Induction Algorithm, in Jouirnal of Artificial Intelligence Research 2, 1995, pp. 23-26.
(17) C. X. Ling, Q. Yang, J. Wang, and S. Zhang, Decision Trees with Minimal Costs, in Proceedings of the 21st International Conference on Machine Learning (ICML), Banff, Canada, 2004.
(18) P. Domingos, MetaCost: A General Method for Making Classifiers Cost-Sensitive, in Proceedings of the fifth international conference on knowledge discovery and data mining, SIGKDD, San Diego, ACM, New York, 1999, pp. 155-164.
(19) I. H. Witten and E. Frank, Data Mining. Practical Machine Learning Tools and Techniques with Java implementations, Morgan Kaufmann, San Francisco, 2005.
(20) X. Chai, L. Deng, Q. Yang, and C. X. Ling, Test-Cost Sensitive Naive Bayes Classification, in Proceedings of the fourth IEEE International Conference on Data Mining, ICDM, 2004, doi: 10.1109/ICDM.2004.10092.
(21) V. S. Sheng and C. X. Ling, Thresholding for Making Classifiers Cost-sensitive, in Proceeedings of the 21st National Conference on Artificial Intelligence, 2006, pp. 476-481.
(22) S. Wang and X. Yao, Diversity analysis on imbalanced data sets by using ensemble models, in IEEE Symp. Comput. Intell. Data Mining, pp. 324-331, 2009.
(23) D. Tao, X. Tang, X. Li, X. Wu, Asymmetric bagging and random subspace for support vector machines-based relevance feedback in image retrieval, IEEE Trans. Pattern Anal. Mach. Intell., vol. 28, no. 7, pp. 1088-1099, Jul.
(24) C. Li, Classifying imbalanced data using a bagging ensemble variation (BEV), in Proc. 45th An. Southeast Reg. Conf., NY:ACM, pp. 203-208, 2007.
(25) N. V. Chawla, A. Lazarevic, L. O. Hall, and K. W. Bowyer, SMOTEBoost: Improving prediction of the minority class in boosting, Proc. Knowl. Discov. DB, pp. 107-119, 2003.
(26) C. Seiffert, T. Khoshgoftaar, J. Van Hulse, and A. Napolitano, Rusboost: A hybrid approach to alleviating class imbalance, IEEE Trans. Syst. Man Cybern. A Syst. Humans, vol. 40, no. 1, pp. 185-197, Jan. 2010.
(27) X. Y. Liu, J. Wu, and Z. H. Zhou, Exploratory undersampling for class-imbalance learning, IEEE Trans. Syst. Man Cyber. Appl. Rev, v. 39, n. 2, pp. 539-550, 2009.
(28) T. Maciejewski and J. Stefanowski, Local Neighbourhood Extension of SMOTE for Mining Imbalanced Data, in Proceeedings of 2011 IEEE Symposium on Computational Intelligence and Data Mining (CIDM), Paris, pp. 104-
, 2011, doi: 10.1109/CIDM.2011.5949434.
(29) B. Santoso, H. Wijayanto, K. A. Notodiputro, and B. Sartono, Synthetic Over Sampling Methods for Handling Class Imbalanced Problemas: A Review, in IOP Conf. Series: Earth and Environmental Science 58, 2017, doi: 10.1088/1755-1315/58/1/012031.
(30) S. Hu, Y. Liang, L. Ma, and Y. He, MSMOTE: Improving classification performance when training data is imbalanced, in Proc. 2nd Int. Workshop Comput. Sci. Eng., vol. 2, 2009, pp. 13-17.
(31) N. V. Chawla, C4.5 and Imbalanced Data sets: Investigating the Effect of Sampling Method, Probabilistic Estimate, and Decision Tree Structure, in ICML Workshop on Learning from Imbalanced Data sets, Washington, DC, 2003.
(32) M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I.H. Witten (2009), The WEKA Data Mining Software: An Update, SIGKDD Explorations, Volume 11, Issue 1.
(33) I. H. Witten and E. Frank, Data Mining. Practical Machine Learning Tools and Techniques, Morgan Kaufmann, 2005.
(34) P. Chapman et al., CRISP-DM 1.0. Data Mining guide, 2000.
(35) L. Breiman, Bagging predictors, Machine Learning, vol. 24, no. 2, pp. 123–140, 1996, doi:10.1007/BF00058655.
(36) L. I. Kuncheva, Combining Pattern Classifiers: Methods and Algorithms, John Wiley and Sons, Inc., 2004.