自然语言处理文本查重优化算法设计被引量：11

Algorithm Design of Text Duplicated-checking Based on Natural Language Processing

作　　者：董星彤陈士宏[1] 陈淑鑫 DONG Xing-tong;CHEN Shi-hong;CHEN Shu-xin(School of Chemical and Materials Engineering, Beijing Technology and Business University, Beijing 100048, China;Department of Communication and Electronic Engineering, Qiqihar University, Qiqihar 161006, China;Department of Computer Science and Technology, Tianjin Ren'ai Collage, Tianjin 301636, China)

机构地区：[1]北京工商大学化学与材料工程学院,北京100048 [2]齐齐哈尔大学通信与电子工程学院,齐齐哈尔161006 [3]天津仁爱学院计算机科学与技术系,天津301636

出　　处：《科学技术与工程》2022年第3期1091-1097,共7页Science Technology and Engineering

基　　金：国家自然科学基金(U2031142);国家自然科学基金青年科学基金(11803013)。

摘　　要：为了探索高校学生实习时提交的实践报告文本存在着重复的问题,从高校教学管理部门收集到相关文本的分类数据,结合Jieba分词工具处理文本信息,利用Word2vec词向量转换技术,表现了自然语言精准的语义分析能力。考虑到主题词抽取、概率分布情况及时间复杂度三个方面,使用Python的OS库完成批处理去重、去停用词和去非中文词,运用重要采样思想优化LDA(latent dirichlet allocation),模型,提出了新的训练模型ISLDA(importance sampling latent dirichlet allocation)抽取主题词汇,并采用余弦相似度计算重复率。更好地实现了文本查重算法模型的优化,对比两个模型的主题词类别、各词汇分布概率,结果表明新训练模型优化了主题模型,提高了计算模型训练准确率及测试文本的查重能力,较理想地实现了文本查重分析设计方法。With the aim of exploring the problem of duplication in the practice report texts submitted by college students during their internship,the classification data of relevant texts was collected from college teaching management department.The Jieba word segmentation tool was applied to analyze the text information,while the Word2vec word vector conversion technology was adopted to illustrate the natural language accurate semantic analysis capabilities.Taking such three aspects into account as topic word extraction,probability distribution,and time complexity,the Python OS library was used to complete batch processing in order to remove duplication,stop words and non-Chinese word.An important sampling method was presented to optimize the LDA model,a new training model ISLDA was proposed to extract subject vocabulary,and cosine similarity was adopted to calculate the repetition rate.Thus,the optimization of the text duplicate checking algorithm model was better realized than previous works.Comparing the two models in terms of the topic word category and the distribution probability of each vocabulary of,the results show that the topic model is optimized by the new training model,the training accuracy of the calculation model is improved,and eventually the design method of text checking and analysis is ideally realized.

关键词：语义分析查重模型重要性采样文本向量化相似度计算

分类号：TP391.1[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

自然语言处理文本查重优化算法设计被引量：11

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

自然语言处理文本查重优化算法设计 被引量：11

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

自然语言处理文本查重优化算法设计被引量：11