An Overview of Class Imbalance Problem in Supervised Learning

Satuluri Naganjaneyulu*, Mrithyumjaya Rao Kuppa**
* Associate Professor, Lakireddy Bali Reddy College of Engineering, Mylavaram, India.
** Professor, Vaagdevi College of Engineering, Warangal, India.
Periodicity:May - July'2012
DOI : https://doi.org/10.26634/jcs.1.3.1886

Abstract

In Data mining and Knowledge Discovery hidden and valuable knowledge from the data sources is discovered. The traditional algorithms used for knowledge discovery are bottle necked due to wide range of data sources availability. Class imbalance is a one of the problem arises due to data source which provide unequal class i.e. examples of one class in a training data set vastly outnumber examples of the other class(es). This paper presents an updated literature survey of current class imbalance learning methods for inducing models which handle imbalanced datasets efficiently.

Keywords

Classification, class imbalance, under-sampling, over-sampling.

How to Cite this Article?

Naganjaneyulu, S. and Kuppa, M. (2012). An Overview Of Class Imbalance Problem In Supervised Learning. i-manager’s Journal on Communication Engineering and Systems, 1(3), 1-10. https://doi.org/10.26634/jcs.1.3.1886

References

[1]. Juanli Hu, Jiabin Deng & Mingxiang Sui (2009). A New Approach for Decision Tree Based on Principal Component Analysis, Proceedings of Conference on Computational Intelligence and Software Engineering(pp 1-4).
[2]. Huimin Zhao & Atish P. Sinha (2005, September). An Efficient Algorithm for Generating Generalized Decision Forests, IEEE Transactions on Systems, Man, and Cybernetics -Part A : Systems and Humans(VOL.35,NO.5, pp: 287-299).
[3]. D. Liu, C. Lai & W. Lee (2009). A Hybrid of Sequential Rules and Collaborative Filtering for Product Recommendation, Information Sciences 179 (20), pp: 3505-3519.
[4]. M. Mitchell (1997). Machine Learning. McGraw Hill, New York.
[5]. David Hand, HeikkiMannila, & Padhraic Smyth (2001, August). Principles of Data Mining. MIT Press.
[6]. Jiawei Han & MichelineKamber (2000,April). Data Mining: Concepts and Techniques. Morgan Kaufmann,
[7]. J. Quinlan (1993). C4.5 Programs for Machine Learning, San Mateo, CA:Morgan Kaufmann.
[8]. L. Breiman, J. Friedman, R. Olshen & C. Stone (1984). Classification and Regression Trees. Belmont,CA: Wadsworth.
[9]. J. Quinlan (1986). Induction of decision trees, Machine Learning – 1: 81-106.
[10]. J. Wu, S. C. Brubaker, M. D. Mullin, & J. M. Rehg (2008,Mar). “Fast asymmetric learning for cascade face detection,” IEEE Trans. Pattern Anal. Mach. Intell.,(vol. 30, no. 3, pp. 369–382).
[11]. N. V. Chawla, N. Japkowicz, & A. Kotcz, Eds (2003). Proc. ICML Workshop Learn. Imbalanced Data Sets.
[12]. N. Japkowicz, Ed.(2000). Proc. AAAI Workshop Learn. Imbalanced Data Sets.
[13]. G. M.Weiss (2004, June). “Mining with rarity: A unifying framework,” ACM SIGKDD Explor. Newslett., (vol. 6, no. 1, pp. 7–19).
[14]. N. V. Chawla, N. Japkowicz, and A. Kolcz, Eds. (2004). Special Issue Learning Imbalanced Datasets, SIGKDD Explor. Newsl.,vol. 6(1).
[15]. W.-Z. Lu & D.Wang (2008). “Ground-level ozone prediction by support vector machine approach with a cost-sensitive classification scheme,” Sci. Total. Enviro., (vol. 395, no.2-3, pp. 109–116).
[16]. Y.-M. Huang, C.-M. Hung, & H. C. Jiau, (2006). “Evaluation of neural networks and data mining methods on a credit assessment task for class imbalance problem,” Nonlinear Anal. R. World Appl., (vol. 7, no. 4, pp. 720–747).
[17]. D. Cieslak, N. Chawla, and A. Striegel (2006). “Combating imbalance in network intrusion datasets,” in IEEE Int. Conf. Granular Comput., , (pp. 732–737).
[18]. M. A. Mazurowski, P. A. Habas, J. M. Zurada, J. Y. Lo, J. A. Baker, & G. D. Tourassi (2008). “Training neural network classifiers for medical decision making: The effects of imbalanced datasets on classification performance,” Neural Netw.,(vol. 21, no. 2–3, pp. 427–436).
[19]. A. Freitas, A. Costa-Pereira, and P. Brazdil. “Cost-sensitive decision trees applied to medical data (2007),” in Data Warehousing Knowl. Discov. (Lecture Notes Series in Computer Science), I. Song, J. Eder, and T. Nguyen, Eds., Berlin/Heidelberg, Germany: Springer, (vol. 4654, pp. 303–312).
[20]. K.Kilic¸,O¨ zgeUncu & I. B. Tu¨rksen, (2007). “Comparison of different strategies of utilizing fuzzy clustering in structure identification,” Inf. Sci., (vol. 177, no. 23, pp. 5153–5162).
[21]. M. E. Celebi, H. A. Kingravi, B. Uddin, H. Iyatomi, Y. A. Aslandogan, W. V. Stoecker, & R. H. Moss (2007), “A methodological approach to the classification of dermoscopy images,” Comput.Med. Imag. Grap.,(vol. 31, no. 6, pp. 362–373).
[22]. X. Peng & I. King (2008), “Robust BMPM training based on second-order cone programming and its application in medical diagnosis,” Neural Netw., (vol. 21, no. 2–3, pp. 450–457).
[23]. RukshanBatuwita & Vasile Palade (2010,June) FSVM-CIL: Fuzzy Support Vector Machines for Class Imbalance Learning, IEEE TRANSACTIONS ON FUZZY SYSTEMS, (VOL. 18, NO. 3,pp no:558-571).
[24]. N. Japkowicz and S. Stephen (2002), “The Class Imbalance Problem: A Systematic Study,” Intelligent Data Analysis (vol. 6, pp. 429-450).
[25]. M. Kubat and S. Matwin (1997), “Addressing the Curse of Imbalanced Training Sets: One-Sided Selection,” Proc. 14th Int'l Conf. Machine Learning,(pp. 179-186).
[26]. G.E.A.P.A. Batista, R.C. Prati, & M.C. Monard (2004), “A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data,” SIGKDD Explorations,6(1): 20-29.
[27]. D. Cieslak & N. Chawla (2008), “Learning decision trees for unbalanced data,” in Machine Learning and Knowledge Discovery in Databases. Berlin, Germany: Springer-Verlag,( pp. 241–256).
[28]. G.Weiss(2004), “Mining with rarity: A unifying framework,” SIGKDD Explor. Newslett., (vol. 6, no. 1, pp. 7–19).
[29]. N. Chawla, K. Bowyer, & P. Kegelmeyer (2002), “SMOTE: Synthetic minority over-sampling technique,” J. Artif. Intell. Res.,( vol. 16, pp. 321–357).
[30]. J. Zhang & I. Mani (2003), “KNN approach to unbalanced data distributions: A case study involving information extraction,” in Proc. Int. Conf. Mach. Learning, Workshop: Learning Imbalanced Data Sets, Washington, DC,(pp. 42–48).
[31]. A. Asuncion D. Newman. (2007). UCI Repository of Machine Learning Database (School of Information and Computer Science), Irvine, CA: Univ. of California [Online]. Available: http://www.ics.uci.edu/∼mlearn/MLRepository. html.
[32]. T. Jo & N. Japkowicz(2004). “Class imbalances versus small disjuncts,” ACM SIGKDD Explor. Newslett.,( vol. 6, no. 1, pp. 40–49).
[33]. S. Zou, Y. Huang, Y. Wang, J. Wang, & C. Zhou (2008). “SVM learning from imbalanced data by GA sampling for protein domain prediction,” in Proc. 9th Int. Conf. Young Comput. Sci., Hunan, China, , (pp. 982– 987).
[34]. Jinguha Wang, JaneYou ,QinLi, & YongXu (2012). ”Extract minimum positive and maximum negative features for imbalanced binary classification”, Pattern Recognition 45 : 1136–1145.
[35]. Iain Brown, Christophe Mues (2012). “An experimental comparison of classification algorithms for imbalanced credit scoring data sets”, Expert Systems with Applications 39 : 3446–3453.
[36]. Salvador Garc?´a, Joaqu?´nDerrac, Isaac Triguero, Cristobal J. Carmona, Francisco Herrera (2012). “Evolutionary-based selection of generalized instances for imbalanced classification”, Knowledge-Based Systems 25 : 3–12.
[37]. Jin Xiao, Ling Xie, Changzheng He & Xiaoyi Jiang (2012). ” Dynamic classifier ensemble model for customer classification with imbalanced class distribution”, Expert Systems with Applications 39 : 3668–3675.
[38]. Victoria López, Alberto Fernández, Jose G. Moreno-Torres, Francisco Herrera (2012). “Analysis of preprocessing vs. cost-sensitive learning for imbalanced classification. Open problems on intrinsic data characteristics”, Expert Systems with Applications 39 : 6585–6608.
[39]. Yang Yong. “The Research of Imbalanced Data Set of Sample Sampling Method Based on K-Means Cluster and Genetic Algorithm”, Energy Procedia 17 : 164 – 170.
[40]. Chris Seiffert, Taghi M. Khoshgoftaar, Jason Van Hulse, & Amri Napolitano (2010,Jan). ”RUSBoost: A Hybrid Approach to Alleviating Class Imbalance”, IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS-PART A: SYSTEMS AND HUMANS,(VOL. 40, NO. 1 pp 185-197).
[41]. V. Garcia, J.S. Sanchez , & R.A. Mollineda (2012). ”On the effectiveness of preprocessing methods when dealing with different levels of class imbalance”, Knowledge-Based Systems 25 : 13–21.
[42]. María Dolores Pérez-Godoy, Alberto Fernández, Antonio Jesús Rivera, María José del Jesus (2010). ”Analysis of an evolutionary RBFN design algorithm, CO2RBFN, for imbalanced data sets”, Pattern Recognition Letters 31 :2375–2388.
[43]. Der-Chiang Li, Chiao-WenLiu, & SusanC.Hu (2010). ” A learning method for the class imbalance problem with medical data sets”, Computers in Biology and Medicine 40 : 509–518.
[44]. EnhongChe, Yanggang Lin, HuiXiong, QimingLuo, & Haiping Ma (2011). “Exploiting probabilistic topic models to improve text categorization under class imbalance”, Information Processing and Management 47 : 202–214.
[45]. Alberto Fernández, María Josédel Jesus, & Francisco Herrera (2010). ”On the 2-tuples based genetic tuning performance for fuzzy rule based classification systems in imbalanced data-sets”, Information Sciences 180 : 1268–1291.
[46]. Z. Chi, H. Yan, T. Pham, (1996). Fuzzy Algorithms with Applications to Image Processing and Pattern Recognition, World Scientific.
[47]. H. Ishibuchi, T. Yamamoto, T. Nakashima (2005). “Hybridization of fuzzy GBML approaches for pattern classification problems”, IEEE Transactions on System, Man and Cybernetics B 35 (2) : 359–365.
[48]. J. Burez, D. Van den Poel (2009). ”Handling class imbalance in customer churn prediction”, Expert Systems with Applications 36 : 4626–4636.
[49]. Che-Chang Hsu, Kuo-Shong Wang, Shih-Hsing Chang (2011). ”Bayesian decision theory for support vector machines: Imbalance measurement and feature optimization”, Expert Systems with Applications 38: 4698–4704.
[50]. Alberto Fernández, María José del Jesus & Francisco Herrera(2009). ”On the influence of an adaptive inference system in fuzzy rule based classification systems for imbalanced data-sets”, Expert Systems with Applications 36 : 9805–9812.
[51]. Jordan M. Malof, Maciej A. Mazurowski, Georgia D. Tourassi (2012).” The effect of class imbalance on case selection for case-based classifiers: An empirical study in the context of medical decision support”, Neural Networks 25 : 141–145.
[52]. Blake, C., & Merz, C.J. (2000). UCI repository ofmachinelearning databases. Machine-readable datarepository, Department of Information and Computer Science, University of California at Irvine, Irvine, CA. at http://www.ics.uci.edu/mlearn/MLRepository.html.
If you have access to this article please login to view the article or kindly login to purchase the article

Purchase Instant Access

Single Article

North Americas,UK,
Middle East,Europe
India Rest of world
USD EUR INR USD-ROW
Online 15 15

Options for accessing this content:
  • If you would like institutional access to this content, please recommend the title to your librarian.
    Library Recommendation Form
  • If you already have i-manager's user account: Login above and proceed to purchase the article.
  • New Users: Please register, then proceed to purchase the article.