A Survey of Different Techniques to Handle An Unbalanced Dataset

Pooja Yerawar, Ganesh Pakle

Open Access Article Go Back

A Survey of Different Techniques to Handle An Unbalanced Dataset

Pooja Yerawar¹ , Ganesh Pakle²

Section:Survey Paper, Product Type: Journal Paper
Volume-6 , Issue-12 , Page no. 818-824, Dec-2018

CrossRef-DOI: https://doi.org/10.26438/ijcse/v6i12.818824

Online published on Dec 31, 2018

Copyright © Pooja Yerawar, Ganesh Pakle . This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

View this paper at Google Scholar | DPI Digital Library

XML View

PDF Download

How to Cite this Paper

IEEE Citation
MLA Citation
APA Citation
BibTex Citation
RIS Citation

IEEE Style Citation: Pooja Yerawar, Ganesh Pakle, “A Survey of Different Techniques to Handle An Unbalanced Dataset,” International Journal of Computer Sciences and Engineering, Vol.6, Issue.12, pp.818-824, 2018.

MLA Style Citation: Pooja Yerawar, Ganesh Pakle "A Survey of Different Techniques to Handle An Unbalanced Dataset." International Journal of Computer Sciences and Engineering 6.12 (2018): 818-824.

APA Style Citation: Pooja Yerawar, Ganesh Pakle, (2018). A Survey of Different Techniques to Handle An Unbalanced Dataset. International Journal of Computer Sciences and Engineering, 6(12), 818-824.

BibTex Style Citation:
@article{Yerawar_2018,
author = {Pooja Yerawar, Ganesh Pakle},
title = {A Survey of Different Techniques to Handle An Unbalanced Dataset},
journal = {International Journal of Computer Sciences and Engineering},
issue_date = {12 2018},
volume = {6},
Issue = {12},
month = {12},
year = {2018},
issn = {2347-2693},
pages = {818-824},
url = {https://www.ijcseonline.org/full_paper_view.php?paper_id=3422},
doi = {https://doi.org/10.26438/ijcse/v6i12.818824}
publisher = {IJCSE, Indore, INDIA},
}

RIS Style Citation:
TY - JOUR
DO = {https://doi.org/10.26438/ijcse/v6i12.818824}
UR - https://www.ijcseonline.org/full_paper_view.php?paper_id=3422
TI - A Survey of Different Techniques to Handle An Unbalanced Dataset
T2 - International Journal of Computer Sciences and Engineering
AU - Pooja Yerawar, Ganesh Pakle
PY - 2018
DA - 2018/12/31
PB - IJCSE, Indore, INDIA
SP - 818-824
IS - 12
VL - 6
SN - 2347-2693
ER -

VIEWS	PDF	XML
310	220 downloads	130 downloads

Bar Line

Abstract

Researchers has a big challenge to handle the unbalanced data, which is an issue found in many real-world applications in engineering. Dataset is unbalanced means at least one class has very fewer examples than another class. In such dataset, examples are taken as majority class (i.e. negative) and minority class (i.e. positive). This paper contains a survey of what is mean by imbalance data, an issue with it, its challenges, examples of applications, different approaches to rebalance the data like ensemble techniques( like boosting, bagging), sampling, feature selection, algorithmic to increase the performance of classification have been proposed.

Key-Words / Index Term

Imbalanced data, classifiers, sampling, feature selection, ensemble methods, hybrid method

References

[1] Sonak and R. A. Patankar, “A Survey on Methods to Handle Imbalance Dataset,” International Journal of Computer Science and Mobile Computing, vol. 4, no. 11, pp. 338–343, 2015. [Online].Available:http://ijcsmc.com/docs/papers/November2015/ V4I11201573.pdf
[2] Singh and A. Purohit, “A survey on methods for solving data imbalance problem for classification,” International Journal of Computer Applications, vol. 127, no. 15, pp. 37–41, 2015.
[3] N.Chawla, K. Bowyer, L. Hall, and W. Kegelmeyer, “SMOTE: Synthetic Minority Over-sampling Technique,” Jair, vol. 16, pp.321–357, 2002.
[4] More, “Survey of resampling techniques for improving classification performance in unbalanced datasets,” vol. 10000, pp.1–7, 2016. [Online]. Available: http://arxiv.org/abs/1608.06048
[5] Z. Zheng, X. Wu, and R. Srihari, “Feature selection for text categorization on imbalanced data,” ACM Sigkdd Explorations Newsletter, vol. 6, no. 1, pp. 80–89, 2004.
[6] M. Wasikowski and X. W. Chen, “Combating the small sample class imbalance problem using feature selection,” IEEE Transactions
[7] Guyon and A. Elisseeff, “An introduction to variable and feature selection,” Journal of machine learning research, vol. 3, no.Mar, pp. 1157–1182, 2003.
[8] M. Alibeigi, S. Hashemi, and A. Hamzeh, “Dbfs: An effective density based feature selection scheme for small sample size and high dimensional imbalanced data sets,” Data & Knowledge Engineering, vol. 81, pp. 67–103, 2012.
[9] F. Sebastiani, “Machine learning in automated text categorization,” ACM computing surveys (CSUR), vol. 34, no. 1, pp. 1–47,2002.
[10] M. Alibeigi, S. Hashemi, and A. Hamzeh, “DBFS: An effective Density Based Feature Selection scheme for small sample size and high dimensional imbalanced data sets,” Data and Knowledge Engineering, vol. 81-82, pp. 67–103, 2012. [Online].Available: http://dx.doi.org/10.1016/j.datak.2012.08.001
[11] N. V. Chawla, N. Japkowicz, and A. Kotcz, “Special issue on learning from imbalanced data sets,” ACM Sigkdd Explorations Newsletter, vol. 6, no. 1, pp. 1–6, 2004.
[12] H. He and E. A. Garcia, “Learning from imbalanced data,” IEEE Transactions on Knowledge & Data Engineering, no. 9, pp.1263–1284, 2008.
[13] X.-w. Chen and M. Wasikowski, “Fast: a roc-based feature selection metric for small samples and imbalanced data classification problems,” in Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM,2008, pp. 124–132.
[14] G. Forman, “An extensive empirical study of feature selection metrics for text classification,” Journal of machine learning research, vol. 3, no. Mar, pp. 1289–1305, 2003.
[15] H. Pant and R. Srivastava, “A Survey on Feature Selection Methods For Imbalanced Datasets,” International Journal of Computer Engineering and Applications, vol. 9, no. 2, pp. 197–204, 2015.
[16] L. Breiman, "Bagging Predictors," Machine learning, vol. 24, no. 2, pp. 123–140, 1996.

[17] Y. Freund and R. E. Schapire, “A decision-theoretic generalization of on-line learning and an application to boosting,” Journal of computer and system sciences, vol. 55, no. 1, pp. 119–139, 1997.
[18] R. E. Schapire, “The strength of weak learnability,” Machine learning, vol. 5, no. 2, pp. 197–227, 1990.
[19] X. Wu, V. Kumar, J. R. Quinlan, J. Ghosh, Q. Yang, H. Motoda, G. J. McLachlan, A. Ng, B. Liu, S. Y. Philip et al., “Top 10 algorithms in data mining,” Knowledge and information systems, vol. 14, no. 1, pp. 1–37, 2008.
[20] J. Friedman, T. Hastie, R. Tibshirani et al., “Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors),” The annals of statistics, vol. 28, no. 2, pp. 337–407, 2000.
[21] Rudin, I. Daubechies, and R. E. Schapire, "The dynamics of AdaBoost: Cyclic behavior and convergence of margins," Journal of Machine Learning Research, vol. 5, no. Dec, pp. 1557–1595, 2004.
[22] K. Veropoulos, C. Campbell, N. Cristianini et al., “Controlling the sensitivity of support vector machines,” in Proceedings of the international joint conference on AI, vol. 55, 1999, p. 60.
[23] N. V. Chawla, D. A. Cieslak, L. O. Hall, and A. Joshi, "Automatically countering imbalance and its empirical relationship to cost," Data Mining and Knowledge Discovery, vol. 17, no. 2, pp. 225–252, 2008.

[24] N. V. Chawla, K.W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, "Smote: synthetic minority over-sampling technique,” Journal of artificial intelligence research, vol. 16, pp. 321–357, 2002.
[25] Seiffert, T. M. Khoshgoftaar, J. Van Hulse, and A. Napolitano, “Rusboost: A hybrid approach to alleviating class imbalance,”
[26] Q. Wang, “A hybrid sampling SVM approach to imbalanced data classification," in Abstract and Applied Analysis, vol. 2014.Hindawi, 2014.
[27] G. E. Batista, R. C. Prati, and M. C. Monard, “A study of the behavior of several methods for balancing machine learning training data,” ACM SIGKDD explorations newsletter, vol. 6, no. 1, pp. 20–29, 2004.
[28] M. Galar, A. Fernandez, E. Barrenechea, H. Bustince, and F. Herrera, "A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches," IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), vol. 42, no. 4, pp. 463–484, 2012.
[29] D.K. Mittal, V. Verma, R. Rastogi, "A Comparative Study and New Model for Smart Mirror", International Journal of Scientific Research in Computer Science and Engineering, Vol.5, Issue.6, pp.58-61, 2017
[30] Dharmendra Sharma and Suresh Jain, "Evaluation of Stemming and Stop Word Techniques on Text Classification Problem", International Journal of Scientific Research in Computer Science and Engineering, Vol.3, Issue.2, pp.1-4, 2015

Citations	2325
h-index	16
i10-index	47