Publication Details

AFRICAN RESEARCH NEXUS

SHINING A SPOTLIGHT ON AFRICAN RESEARCH

computer science

A Multi-Schematic Classifier-Independent Oversampling Approach for Imbalanced Datasets

IEEE Access, Volume 9, Year 2021

Labelled imbalanced data, used for classification problems, have an unequal distribution of samples over the classes. Traditional classification models, such as random forest, gradient boosting, face a problem when dealing with imbalanced datasets. Over 85 oversampling algorithms, mostly extensions of the SMOTE algorithm, have been built over the past two decades, to solve the problem of imbalanced datasets. However, it has been evident from previous studies that different oversampling algorithms have different degrees of efficiency with different classifiers. With numerous algorithms available, it is difficult to decide on an oversampling algorithm for a chosen classifier. Here, we overcome this problem with a multi-schematic and classifier-independent oversampling approach, referred to as ProWRAS (Proximity Weighted Random Affine Shadowsampling). ProWRAS integrates the Localized Random Affine Shadowsampling (LoRAS) algorithm and the Proximity Weighted Synthetic oversampling (ProWSyn) algorithm. By controlling the variance of the synthetic samples, as well as a proximity-weighted clustering system of the minority class data, the ProWRAS algorithm improves performance, compared to algorithms that generate synthetic samples through modelling high dimensional convex spaces of the minority class. ProWRAS is multi-schematic by employing four oversampling schemes, each of which has its unique way to model the variance of the generated data. The proximity weighted clustering approach of ProWRAS allows one to generate low variance synthetic samples only in borderline clusters to avoid overlap with the majority class. Most importantly, the performance of ProWRAS with proper choice of oversampling schemes, is independent of the classifier used. We have benchmarked our newly developed ProWRAS algorithm against five state-of-the-art oversampling models and four different classifiers on 20 publicly available datasets. Our results show that ProWRAS outperforms other oversampling algorithms in a statistically significant way, in terms of both F1-score and κ -score. Moreover, we have introduced a novel measure for classifier independence J -score, and showed quantitatively that ProWRAS performs better, independent of the classifier used. Thus, ProWRAS is highly effective for homogeneous tabular data where convex modelling of the data space can be done. In practice, ProWRAS customizes synthetic sample generation according to a classifier of choice and thereby reduces benchmarking efforts. © 2013 IEEE.
Statistics
Citations: 7
Authors: 3
Affiliations: 2