Addressing Imbalance in Health Datasets: A New Method NR-Clustering SMOTE and Distance Metric Modification  

在线阅读下载全文

作  者:Hairani Hairani Triyanna Widiyaningtyas Didik Dwi Prasetya Afrig Aminuddin 

机构地区:[1]Department of Electrical Engineering and Informatics,Faculty of Engineering,Universitas Negeri Malang,Malang,65145,Indonesia [2]Department of Computer Science,Universitas Bumigora,Mataram,83127,Indonesia [3]Department of Computer Graphic and Multimedia,Faculty of Computing,College of Computing and Applied Sciences,Universiti Malaysia Pahang Al-Sultan Abdullah,Pekan,26600,Malaysia

出  处:《Computers, Materials & Continua》2025年第2期2931-2949,共19页计算机、材料和连续体(英文)

基  金:funded by Universitas Negeri Malang,contract number 4.4.841/UN32.14.1/LT/2024.

摘  要:An imbalanced dataset often challenges machine learning, particularly classification methods. Underrepresented minority classes can result in biased and inaccurate models. The Synthetic Minority Over-Sampling Technique (SMOTE) was developed to address the problem of imbalanced data. Over time, several weaknesses of the SMOTE method have been identified in generating synthetic minority class data, such as overlapping, noise, and small disjuncts. However, these studies generally focus on only one of SMOTE’s weaknesses: noise or overlapping. Therefore, this study addresses both issues simultaneously by tackling noise and overlapping in SMOTE-generated data. This study proposes a combined approach of filtering, clustering, and distance modification to reduce noise and overlapping produced by SMOTE. Filtering removes minority class data (noise) located in majority class regions, with the k-nn method applied for filtering. The use of Noise Reduction (NR), which removes data that is considered noise before applying SMOTE, has a positive impact in overcoming data imbalance. Clustering establishes decision boundaries by partitioning data into clusters, allowing SMOTE with modified distance metrics to generate minority class data within each cluster. This SMOTE clustering and distance modification approach aims to minimize overlap in synthetic minority data that could introduce noise. The proposed method is called “NR-Clustering SMOTE,” which has several stages in balancing data: (1) filtering by removing minority classes close to majority classes (data noise) using the k-nn method;(2) clustering data using K-means aims to establish decision boundaries by partitioning data into several clusters;(3) applying SMOTE oversampling with Manhattan distance within each cluster. Test results indicate that the proposed NR-Clustering SMOTE method achieves the best performance across all evaluation metrics for classification methods such as Random Forest, SVM, and Naїve Bayes, compared to the original data and traditional SMOTE

关 键 词:SMOTE modification Clustering-SMOTE manhattan distance 

分 类 号:TP181[自动化与计算机技术—控制理论与控制工程]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象