基于多目标蚁群优化的单类支持向量机相似重复记录检测  被引量:12

Detection of Similar Duplicate Records Based on OCSVM and Multi-objective Ant Colony Optimization

在线阅读下载全文

作  者:吕国俊 曹建军 郑奇斌 常宸 翁年凤 彭琮 Lü Guojun;CAO Jianjun;ZHENG Qibin;CHANG Chen;WENG Nianfeng;PENG Cong(Institute of Command and Control Engineering,Army Engineering University,Nanjing 210007,Jiangsu,China;The 63rd Research Institute,National University of Defense Technology,Nanjing 210007,Jiangsu,China)

机构地区:[1]陆军工程大学指挥控制工程学院,江苏南京210007 [2]国防科技大学第六十三研究所,江苏南京210007

出  处:《兵工学报》2020年第2期324-331,共8页Acta Armamentarii

基  金:国家自然科学基金面上项目(61371196);中国博士后科学基金项目(2015M582832)

摘  要:为解决数据源中相似重复记录样本稀少问题,提出一种基于多目标蚁群优化的单类支持向量机相似重复记录分类检测方法。根据记录对中2条记录是否相似,将相似重复记录检测建模为二分类问题,用单类支持向量机进行分类,并且只用不相似重复记录样本对进行训练;选择合适的属性相似度函数计算记录对之间的相似特征向量,将其作为单类支持向量机分类器的输入进行二分类检测;建立以查准率、查全率、特征数量综合最优为目标的多目标特征选择模型,结合训练样本为单类样本的特点,将启发式因子定义为类内散度最小化约束,设计了求解模型的多目标蚁群算法。通过将单类支持向量机算法和支持向量域描述算法、传统二分类支持向量机算法进行对比,结果验证了单类支持向量机算法的有效性和优越性。A classification method based on one-class support vector machine(OCSVM)and multi-objective ant colony optimization is proposed for solving the problem of a small number of similar duplicately recorded samples.Based on whether the two records are similar,the detection of similar duplicate records is modeled as a two-class classification problem,the classification is performed by OCSVM,and the classifier is trained by only using the dissimilar duplicately recorded sample pairs.Appropriate attribute similarity function is selected to calculate the similar feature vectors of two records which are taken as the OCSVM s input.A multi-object model for feature selection based on the integrated optimization of recall ratio,precision ratio and feature set s size is set up.According to the characteristic of the single class training samples,a multi-object ant colony algorithm is designed to solve the model,in which the heuristic factor is defined as the minimization constraint of intra-class divergence.The proposed method is validated by comparing OCSVM with other algorithms,such as support vector domain description algorithm and traditional two-class support vector machine.

关 键 词:数据清洗 相似重复记录检测 多目标蚁群算法 特征选择 单类支持向量机 支持向量域描述 

分 类 号:TP311.11[自动化与计算机技术—计算机软件与理论]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象