基于GAN数据增强的软件缺陷预测聚合模型  被引量:3

Aggregation Model for Software Defect Prediction Based on Data Enhancement by GAN

在线阅读下载全文

作  者:徐金鹏 郭新峰[1] 王瑞波 李济洪 XU Jinpeng;GUO Xinfeng;WANG Ruibo;LI Jihong(School of Automation and Software Engineering,Shanxi University,Taiyuan 030006,China;School of Modern Education Technology,Shanxi University,Taiyuan 030006,China)

机构地区:[1]山西大学自动化与软件学院,太原030006 [2]山西大学现代教育技术学院,太原030006

出  处:《计算机科学》2023年第12期24-31,共8页Computer Science

基  金:国家自然科学基金青年科学基金(61806115)。

摘  要:在软件缺陷预测任务中,通常基于C&K等静态软件特征数据集,使用机器学习分类算法来构建软件缺陷预测(SDP)模型。然而,大多数静态软件特征数据集中缺陷数较少,数据集的类不平衡问题较为严重,导致学习到的SDP模型的预测性能较差。文中基于生成对抗网络(GAN),并利用FID得分筛选生成正例样本数据,增强正例样本量,然后在组块正则化m×2交叉验证(m×2BCV)框架下,通过众数投票法聚合多个子模型的结果,最终构成SDP模型。以PROMISE数据库下的20个数据集为实验数据集,采用随机森林算法构建SDP聚合模型。实验结果表明,与传统的随机上采样、SMOTE、随机下采样相比,所提SDP聚合模型的F1平均值分别提高了10.2%,5.7%,3.4%,且F1的稳定性也得到相应提高;所提SDP聚合模型在20个数据集的评测中,有17个F1值最高。从AUC指标来看,所提方法与传统的采样方法没有明显差异。In the task of software defect prediction,the machine learning classification algorithm is usually used to build a software defect prediction(SDP)model based on dataset with static softwarefeatures such as C&K metrics.However,the number of defects in most datasets with static software metrics is small,the class imbalance in the dataset is serious,resulting in the low prediction performance of the model.Based on generation adversarial network(GAN),this paper uses FID score screening to ge-nerate positive sample data,enhances the amount of postitive data,and then aggregates the results of learned models by majority-voting,and finally build the SDP model based on block-regularized m×2 Cross validation(m×2BCV).20 datasets in PROMISE database are used as the experimental datasets,and random forest algorithm is used to build model.Experimental results show that,compared with the traditional random over-sampling,SMOTE,and random under-sampling,the average F1 values of the SDP aggregation model in the 20 datasets is increased by 10.2%,5.7%,and 3.4%respectively,and the stability of F1 is also improved accordingly.In 17 of the 20 datasets,the SDP aggregation models have the highest F1 values.From the AUC index,there is no significant difference between the proposed method and the traditional sampling method.

关 键 词:生成对抗网络 数据增强 组块正则化交叉验证 软件缺陷预测 聚合模型 

分 类 号:TP311[自动化与计算机技术—计算机软件与理论]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象