Volume 38, Issue 4 e12680
ORIGINAL ARTICLE

A novel cost-sensitive algorithm and new evaluation strategies for regression in imbalanced domains

Lamyaa Sadouk

Corresponding Author

Lamyaa Sadouk

Faculty of Science and Technology Settat, University Hassan Ist, Settat, Morocco

Correspondence

Lamyaa Sadouk, Faculty of Science and Technology Settat, University Hassan Ist, Settat, Morocco.

Email: [email protected]

Search for more papers by this author
Taoufiq Gadi

Taoufiq Gadi

Faculty of Science and Technology Settat, University Hassan Ist, Settat, Morocco

Search for more papers by this author
El Hassan Essoufi

El Hassan Essoufi

Faculty of Science and Technology Settat, University Hassan Ist, Settat, Morocco

Search for more papers by this author
First published: 28 February 2021
Citations: 8

Abstract

Many real-world data mining applications involve obtaining predictive models using imbalanced datasets. Frequently, the least common target variables present within datasets are associated with events that are highly relevant for end users. When these variables are nominal, we have a class-imbalance problem which has been thoroughly studied within machine learning. As for regression tasks where target variables are continuous, few predictive models and evaluation techniques exist. This paper proposes a solution to these challenges. First, we introduce a cost-sensitive learning algorithm based on a neural network trained on the minimization of a biased loss function. Results show a higher or comparable performance and convergence speed to existent techniques. Second, we develop new approaches for performance assessment of regression tasks within imbalanced domains by proposing new scalar measures, namely Geometric Mean Error (GME) and Class-Weighted Error (CWE), as well as new graphical-based measures, namely RECTPR, RECTNR, RECG − Mean and RECCWA curves. Unlike standard measures, our evaluation strategies are shown to be more robust to data imbalance as they reflect the performance of both rare and frequent events.

CONFLICT OF INTEREST

The author declares that there is no conflict of interest that could be perceived as prejudicing the impartiality of the research reported.

DATA AVAILABILITY STATEMENT

The data that support the findings of this study are openly available in Github [Sadouk, L. (2019). A cost-sensitive learning approach and evaluation strategies for handling regression under imbalanced domains [Source Code], https://github.com/lsadouk/imbalanced_regression].

The full text of this article hosted at iucr.org is unavailable due to technical difficulties.