检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:蒋溢[1] 伍书平 胡昆 龙林波 JIANG Yi;WU Shuping;HU Kun;LONG Linbo(College of Computer Science and Technology,Chongqing University of Posts and Telecommunications,Chongqing 400065,China;Cloud Computing Center of Yunnan Branch,China Telecom Corporation Limited,Kunming Yunnan 650200,China)
机构地区:[1]重庆邮电大学计算机科学与技术学院,重庆400065 [2]中国电信股份有限公司云南分公司云计算中心,昆明650200
出 处:《计算机应用》2023年第4期1086-1093,共8页journal of Computer Applications
基 金:国家自然科学基金资助项目(61902045);重庆市技术创新与应用发展专项重点项目(cstc2019jscx‑mbdxX0035)。
摘 要:针对机器学习分类算法在不均衡数据分类问题中对少数类样本识别能力不足的问题,以电信客户流失场景为例,提出一种不均衡数据分类方法 L-CCSmote(Lasso Constructive Covering Smote)。首先,通过套索回归(Lasso)提取流失用户特征以优化模型输入;然后,通过构造性覆盖算法(CCA)建立神经网络生成符合样本整体分布的覆盖;最后,进一步提出单样本覆盖策略、样本多样性策略和样本密度峰值策略,通过以上策略混合采样以平衡数据。选用了KEEL数据库中的13个不均衡数据集和2个脱敏电信客户数据集,分别在逻辑回归(LR)和支持向量机(SVM)分类算法上对该方法进行验证。在LR分类算法上,与SMOTE-Enn(Synthetic Minority Oversampling TEchnique Edited nearest neighbor)相比,所提方法的平均几何平均值(G-MEAN)提升了2.32%;在SVM分类算法上,与Borderline-SMOTE(Borderline Synthetic Minority Oversampling Technique Edited)相比,所提方法的平均G-MEAN提升了2.44%。实验结果表明,所提方法能解决类别偏斜分布影响分类的问题,且对于稀有类的识别能力优于经典平衡数据方法。Aiming at the problem that the machine learning classification algorithms have insufficient ability to identify minority samples in the imbalanced data classification problems,an imbalanced data classification method L-CCSmote(Least absolute shrinkage and selection operator Constructive Covering Synthetic minority oversampling technique)was proposed by taking the telecom customer churn scenario as an example.Firstly,the churn costumer related features were extracted through Lasso(Least absolute shrinkage and selection operator)to optimize the model input.Then,a neural network was built through Constructive Covering Algorithm(CCA)to generate coverages that conformed to the overall distribution of samples.Finally,a single-sample coverage strategy,a sample diversity strategy and a sample density peak strategy were further proposed to perform a hybrid sampling to balance the data.Total of 13 imbalanced datasets and 2 desensitized telecom customer datasets were selected from KEEL data base,and the proposed method was verified on Logistic Regression(LR)and Support Vector Machine(SVM)classification algorithms respectively.On LR classification algorithm,compared with the Synthetic Minority Oversampling TEchnique Edited nearest neighbor(SMOTE-Enn),the proposed method had the average Geometric MEAN(G-MEAN)increased by 2.32%.On SVM classification algorithm,compared with the Borderline-SMOTE(Borderline Synthetic Minority Oversampling Technique),the proposed method had the average G-MEAN increased by 2.44%.Experimental results show that the proposed method can solve the influence of class skew distribution on classification,and its recognition ability for rare classes is better than that of the classical balanced data classification methods.
关 键 词:Lasso 构造性覆盖算法 不均衡数据分类 客户流失预测 混合采样
分 类 号:TP301.6[自动化与计算机技术—计算机系统结构]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.7