基于Lasso和构造性覆盖算法的不均衡数据分类方法  被引量:2

Imbalanced data classification method based on Lasso and constructive covering algorithm

在线阅读下载全文

作  者:蒋溢[1] 伍书平 胡昆 龙林波 JIANG Yi;WU Shuping;HU Kun;LONG Linbo(College of Computer Science and Technology,Chongqing University of Posts and Telecommunications,Chongqing 400065,China;Cloud Computing Center of Yunnan Branch,China Telecom Corporation Limited,Kunming Yunnan 650200,China)

机构地区:[1]重庆邮电大学计算机科学与技术学院,重庆400065 [2]中国电信股份有限公司云南分公司云计算中心,昆明650200

出  处:《计算机应用》2023年第4期1086-1093,共8页journal of Computer Applications

基  金:国家自然科学基金资助项目(61902045);重庆市技术创新与应用发展专项重点项目(cstc2019jscx‑mbdxX0035)。

摘  要:针对机器学习分类算法在不均衡数据分类问题中对少数类样本识别能力不足的问题,以电信客户流失场景为例,提出一种不均衡数据分类方法 L-CCSmote(Lasso Constructive Covering Smote)。首先,通过套索回归(Lasso)提取流失用户特征以优化模型输入;然后,通过构造性覆盖算法(CCA)建立神经网络生成符合样本整体分布的覆盖;最后,进一步提出单样本覆盖策略、样本多样性策略和样本密度峰值策略,通过以上策略混合采样以平衡数据。选用了KEEL数据库中的13个不均衡数据集和2个脱敏电信客户数据集,分别在逻辑回归(LR)和支持向量机(SVM)分类算法上对该方法进行验证。在LR分类算法上,与SMOTE-Enn(Synthetic Minority Oversampling TEchnique Edited nearest neighbor)相比,所提方法的平均几何平均值(G-MEAN)提升了2.32%;在SVM分类算法上,与Borderline-SMOTE(Borderline Synthetic Minority Oversampling Technique Edited)相比,所提方法的平均G-MEAN提升了2.44%。实验结果表明,所提方法能解决类别偏斜分布影响分类的问题,且对于稀有类的识别能力优于经典平衡数据方法。Aiming at the problem that the machine learning classification algorithms have insufficient ability to identify minority samples in the imbalanced data classification problems,an imbalanced data classification method L-CCSmote(Least absolute shrinkage and selection operator Constructive Covering Synthetic minority oversampling technique)was proposed by taking the telecom customer churn scenario as an example.Firstly,the churn costumer related features were extracted through Lasso(Least absolute shrinkage and selection operator)to optimize the model input.Then,a neural network was built through Constructive Covering Algorithm(CCA)to generate coverages that conformed to the overall distribution of samples.Finally,a single-sample coverage strategy,a sample diversity strategy and a sample density peak strategy were further proposed to perform a hybrid sampling to balance the data.Total of 13 imbalanced datasets and 2 desensitized telecom customer datasets were selected from KEEL data base,and the proposed method was verified on Logistic Regression(LR)and Support Vector Machine(SVM)classification algorithms respectively.On LR classification algorithm,compared with the Synthetic Minority Oversampling TEchnique Edited nearest neighbor(SMOTE-Enn),the proposed method had the average Geometric MEAN(G-MEAN)increased by 2.32%.On SVM classification algorithm,compared with the Borderline-SMOTE(Borderline Synthetic Minority Oversampling Technique),the proposed method had the average G-MEAN increased by 2.44%.Experimental results show that the proposed method can solve the influence of class skew distribution on classification,and its recognition ability for rare classes is better than that of the classical balanced data classification methods.

关 键 词:Lasso 构造性覆盖算法 不均衡数据分类 客户流失预测 混合采样 

分 类 号:TP301.6[自动化与计算机技术—计算机系统结构]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象