New developments for addressing class imbalance issue in classification tasks
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Imbalanced datasets pose a significant challenge in machine learning, often resulting in biased models that inadequately learn from minority class instances. The Synthetic Minority Over-sampling Technique (SMOTE) is a well-established method to address this issue by generating synthetic samples for underrepresented classes. This Ph.D. thesis introduces several novel extensions to SMOTE, including Distance ExtSMOTE, Dirichlet ExtSMOTE, FCRP SMOTE, BGMM SMOTE, and Deep ExtSMOTE, each designed to enhance performance and manage complexities in various applications. Class imbalance, compounded by abnormal instances and the curse of dimensionality, adversely impacts model accuracy and reliability. Many existing techniques, including SMOTE, are insufficient for effectively handling these complexities. Our research develops new methodologies to tackle these challenges. DistanceExtSMOTE, Dirichlet-ExtSMOTE, FCRP-SMOTE, and BGMM-SMOTE use sophisticated techniques to handle abnormal instances and improve the quality of synthetic samples by leveraging weighted averages of neighbouring data points. These methods effectively manage outliers and noisy data, enhancing the robustness of the classification models. Additionally, DeepExt-SMOTE integrates these methods with autoencoders to further enhance accuracy when dealing with high-dimensional data. Empirical studies validate the effectiveness of these extensions, demonstrating significant improvements in classification performance across a range of metrics, including F1 scores, PR-AUC, and MCC. These advancements are particularly relevant in real-world applications, such as medical diagnosis, fraud detection, churn prediction, and fault detection. Addressing class imbalance with abnormal instances through these methods leads to more accurate and reliable predictions, ultimately contributing to better decision-making and improved outcomes in these critical areas. This work offers valuable tools for researchers and practitioners by advancing methodologies for handling class imbalance, abnormal instances, and high-dimensional data. The proposed techniques provide enhanced capabilities for managing complex classification tasks and achieving impactful results across diverse application domains.