学习困难与泛化能力感知的软件缺陷预测过采样方法  

A Software defect prediction oversampling technique with generalization and difficulty-aware

在线阅读下载全文

作  者:范洪旗 严远亭 张以文[1,2] 张燕平[1,2] FAN Hongqi;YAN Yuanting;ZHANG Yiwen;ZHANG Yanping(Key Laboratory of Intelligent Computing and Signal Processing,Ministry of Education,Anhui University,Hefei 230601,China;School of Computer Science and Technology,Anhui University,Hefei 230601,China)

机构地区:[1]安徽大学计算智能与信号处理教育部重点实验室,安徽合肥230601 [2]安徽大学计算机科学与技术学院,安徽合肥230601

出  处:《计算机集成制造系统》2024年第8期2663-2671,共9页Computer Integrated Manufacturing Systems

基  金:国家自然科学基金资助项目(61806002,62272001)。

摘  要:软件缺陷数据的类别分布不平衡特点给软件缺陷预测任务带了巨大的挑战。合成过采样是解决这一问题最为主流的技术,但如何设计合适的采样策略避免因引入异常样本而导致的过度泛化风险,始终是软件缺陷预测过采样方法面临的难点。针对这一问题,本文提出一种结合样本学习困难程度和合成泛化影响的过采样方法(GDOS)。具体来说,GDOS方法通过样本的局部先验概率和潜在合成方向上的样本分布信息衡量样本的安全系数与泛化系数,并以此度量样本的选择权重。通过抑制潜在过泛化区域的样本合成概率,给予相对安全的近邻合成方向更高的选择概率,为高质量样本的合成提供保障。在26个PROMISE数据集上的实验表明,GDOS在MCC、pd、pf、F-measure等指标上较于经典的采样方法和专门提出的软件缺陷预测采样方法均取得了更优的性能表现。The class imbalanced distribution of software defect data brings great challenges to software defect prediction.Synthetic oversampling is the most popular technique to solve this problem,but how to design a suitable sampling strategy to avoid the risk of over-generalization caused by the introduction of abnormal samples is still an open challenge for software defect prediction.To solve this problem,a Generalization and Difficulty-aware Oversampling(GDOS)method by combining the influence of sample learning difficulty and synthetic generalization for minority oversampling was proposed.For each oversampling seed sample,GDOS evaluated the selection weights of its assistant minority samples by measuring the safe factor and the generalization factor simultaneously according to its local prior probability and the sample distribution information of potential synthesis direction.Through suppressing the possibility of synthesizing samples in potential over-generalization regions and enhancing the possibility of synthesizing samples in relative safe directions,GDOS guaranteed the synthesis of high-quality samples.Numerical comparison with nine state-of-the-art methods on twenty-six datasets from the PROMISE repository had demonstrated the superiority of GDOS in terms of MCC,pd,pf and F-measure.

关 键 词:软件缺陷预测 类别不平衡 过采样 过度泛化 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象