Imbalanced domains are an important problem that arises in predictive tasks causing a loss in the performance on the most relevant cases for the user. This problem has been extensively studied for classification problems, where the target variable is nominal. Recently, it was recognized that imbalanced domains occur in several other contexts and for multiple tasks, such as regression tasks, where the target variable is continuous. This paper focuses on imbalanced domains in both classification and regression tasks. Resampling strategies are among the most successful approaches to address imbalanced domains. In this work, we propose variants of existing resampling strategies that are able to take into account the information regarding the neighbourhood of the examples. Instead of performing sampling uniformly, our proposals bias the strategies to reinforce some regions of the data sets. With an extensive set of experiments, we provide evidence of the advantage of introducing a neighbourhood bias in the resampling strategies for both classification and regression tasks with imbalanced data sets.

REFERENCES

Branco, P. (2014). Re-sampling approaches for regression tasks under imbalanced domains. (Unpublished Master's Thesis), Dep. Computer Science, Faculty of Sciences - University of Porto.
Google Scholar
Branco, P., Ribeiro, R. P., & Torgo, L. (2016). UBL: An R package for utility-based learning. arXiv preprint arXiv:1604.08079.
Google Scholar
Branco, P., Torgo, L., & Ribeiro, R. P. (2016). A survey of predictive modeling on imbalanced domains. ACM Computing Surveys (CSUR), 49(2), 31.
10.1145/2907070
Web of Science® Google Scholar
Branco, P., Torgo, L., & Ribeiro, R. P. (2017). Exploring resampling with neighborhood bias on imbalanced regression problems. In Portuguese conference on artificial intelligence, Springer, Cham, pp. 513–524.
Google Scholar
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic minority over-sampling technique. JAIR, 16, 321–357.
10.1613/jair.953
Web of Science® Google Scholar
Davis, J., & Goadrich, M. (2006). The relationship between precision-recall and ROC curves. In Icml'06: Proc. of the 23rd int. conf. on machine learning pp. 233–240. Pittsburgh, Pennsylvania, USA: ACM.
10.1145/1143844.1143874
Google Scholar
Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets. The Journal of Machine Learning Research, 7, 1–30.
Web of Science® Google Scholar
Dimitriadou, E., Hornik, K., Leisch, F., Meyer, D., & Weingessel, A. (2011). e1071: Misc functions of the Department of Statistics (e1071), TU Wien. [Computer software manual].
Google Scholar
Friedman, M. (1937). The use of ranks to avoid the assumption of normality implicit in the analysis of variance. Journal of the american statistical association, 32(200), 675–701.
10.1080/01621459.1937.10503522
Web of Science® Google Scholar
He, H., Bai, Y., Garcia, E. A., & Li, S. (2008). ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In Neural networks, 2008. IEEE international joint conference on (pp. 1322–1328). Hong Kong, China: IEEE.
Google Scholar
He, H., & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21(9), 1263–1284.
10.1109/TKDE.2008.239
Web of Science® Google Scholar
He, H., & Ma, Y. (2013). Imbalanced learning: Foundations, algorithms, and applications. Hoboken, New Jersey: John Wiley & Sons.
10.1002/9781118646106
Google Scholar
Krawczyk, B. (2016). Learning from imbalanced data: Open challenges and future directions. Progress in Artificial Intelligence, 1–12.
Google Scholar
Liaw, A., & Wiener, M. (2002). Classification and regression by randomForest. R News, 2(3), 18–22.
Google Scholar
López, V., Fernández, A., García, S., Palade, V., & Herrera, F. (2013). An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics. Information Sciences, 250, 113–141.
10.1016/j.ins.2013.07.007
Web of Science® Google Scholar
Milborrow, S. (2012). earth: Multivariate adaptive regression spline models. Derived from mda:mars by Trevor Hastie and Rob Tibshirani. [Computer software manual].
Google Scholar
Ribeiro, R. P. (2011). Utility-based regression. (Unpublished doctoral dissertation), Dep. Computer Science, Faculty of Sciences - University of Porto.
Google Scholar
Therneau, T., Atkinson, B., & Ripley, B. (2017). rpart: Recursive partitioning and regression trees. [Computer software manual]. Retrieved from https://CRAN.R-project.org/package=rpart (R package version 4.1-11).
Google Scholar
Torgo, L. (2014). An infra-structure for performance estimation and experimental comparison of predictive models in r. CoRR, abs/1412.0436 pp. 1–40.
Google Scholar
Torgo, L., Branco, P., Ribeiro, R. P., & Pfahringer, B. (2015). Resampling strategies for regression. Expert Systems, 32(3), 465–476.
10.1111/exsy.12081
Web of Science® Google Scholar
Torgo, L., & Ribeiro, R. P. (2007). Utility-based regression. In Pkdd'07 (pp. 597–604). Warsaw, Poland: Springer.
10.1007/978-3-540-74976-9_63
Google Scholar
Torgo, L., & Ribeiro, R. P. (2009). Precision and recall in regression. In Ds'09: 12th int. conf. on discovery science (pp. 332–346). Porto, Portugal: Springer.
10.1007/978-3-642-04747-3_26
Google Scholar
Torgo, L., Ribeiro, R. P., Pfahringer, B., & Branco, P. (2013). SMOTE for regression. In Progress in artificial intelligence, Springer, Heidelberg, pp. 378–389.
Google Scholar
Venables, W. N., & Ripley, B. D. (2002). Modern applied statistics with S (4th ed.). New York: Springer. Retrieved from http://www.stats.ox.ac.uk/pub/MASS4
10.1007/978-0-387-21706-2
Google Scholar

Citing Literature

Volume35, Issue4

Fourth special issue on knowledge discovery and business intelligence

August 2018

e12311

Resampling with neighbourhood bias on imbalanced domains

Abstract

REFERENCES

Citing Literature

References

Information

About Wiley Online Library

Help & Support

Opportunities

Connect with Wiley

Resampling with neighbourhood bias on imbalanced domains

Abstract

REFERENCES

Citing Literature

References

Related

Information