基于CNN的安防数据相似重复记录检测模型  

APPROXIMATELY DUPLICATE RECORD DETECTION MODEL FOR SECURITY DATA BASED ON CNN

在线阅读下载全文

作  者:王巍[1,2,3] 刘阳 洪惠君[1,2] 梁雅静 Wang Wei;Liu Yang;Hong Huijun;Liang Yajing(School of Information&Electrical Engineering,Hebei University of Engineering,Handan 056038,Hebei,China;Hebei Key Laboratory of Security&Protection Information Sensing and Processing,Handan 056038,Hebei,China;School of Internet of Things Engineering,Jiangnan University,Wuxi 214122,Jiangsu,China)

机构地区:[1]河北工程大学信息与电气工程学院,河北邯郸056038 [2]河北省安防信息感知与处理重点实验室,河北邯郸056038 [3]江南大学物联网工程学院,江苏无锡214122

出  处:《计算机应用与软件》2023年第2期17-25,共9页Computer Applications and Software

基  金:国家自然科学基金项目(61802107);教育部-中国移动科研基金项目(MCM20170204);江苏省博士后科研资助计划项目(1601085C)。

摘  要:安防行业的结构化数据中存在大量的相似重复记录,传统的相似重复记录检测算法的识别率很难满足安防行业的实际需求。针对这种情况,引入了卷积神经网络模型,设计两种以LeNet-5模型为基础的改进模型,一种是输入为词向量矩阵的模型,另一种是输入为相似度矩阵的模型。实验表明,输入为词向量矩阵的模型的精确率和召回率均达到了96%以上,输入为相似度矩阵的模型的精确率和召回率高达98%,并且K折交叉验证的结果说明模型具有较强的泛化能力。There are a lot of approximately duplicate record in the structured data of security industry.The recognition rate of traditional approximately duplicate record detection algorithm is difficult to meet the actual demand of security industry.In order to solve the above problems,a convolutional neural network model was introduced and two improved models based on LeNet-5 model were designed.One was the model with input as word embedding matrix,the other is the model with input as similarity matrix.The experiments show that the precision rate and recall rate of the model with input as word embedding matrix reach more than 96%.And the precision rate and recall rate of the model with input as a similarity matrix reach up to 98%.The experimental results of K-fold cross validation show that both models have strong generalization ability.

关 键 词:安防行业 数据清洗 相似重复记录检测 CNN LeNet-5 

分 类 号:TP311[自动化与计算机技术—计算机软件与理论]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象