检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:董星彤 陈士宏[1] 陈淑鑫 DONG Xing-tong;CHEN Shi-hong;CHEN Shu-xin(School of Chemical and Materials Engineering, Beijing Technology and Business University, Beijing 100048, China;Department of Communication and Electronic Engineering, Qiqihar University, Qiqihar 161006, China;Department of Computer Science and Technology, Tianjin Ren'ai Collage, Tianjin 301636, China)
机构地区:[1]北京工商大学化学与材料工程学院,北京100048 [2]齐齐哈尔大学通信与电子工程学院,齐齐哈尔161006 [3]天津仁爱学院计算机科学与技术系,天津301636
出 处:《科学技术与工程》2022年第3期1091-1097,共7页Science Technology and Engineering
基 金:国家自然科学基金(U2031142);国家自然科学基金青年科学基金(11803013)。
摘 要:为了探索高校学生实习时提交的实践报告文本存在着重复的问题,从高校教学管理部门收集到相关文本的分类数据,结合Jieba分词工具处理文本信息,利用Word2vec词向量转换技术,表现了自然语言精准的语义分析能力。考虑到主题词抽取、概率分布情况及时间复杂度三个方面,使用Python的OS库完成批处理去重、去停用词和去非中文词,运用重要采样思想优化LDA(latent dirichlet allocation),模型,提出了新的训练模型ISLDA(importance sampling latent dirichlet allocation)抽取主题词汇,并采用余弦相似度计算重复率。更好地实现了文本查重算法模型的优化,对比两个模型的主题词类别、各词汇分布概率,结果表明新训练模型优化了主题模型,提高了计算模型训练准确率及测试文本的查重能力,较理想地实现了文本查重分析设计方法。With the aim of exploring the problem of duplication in the practice report texts submitted by college students during their internship,the classification data of relevant texts was collected from college teaching management department.The Jieba word segmentation tool was applied to analyze the text information,while the Word2vec word vector conversion technology was adopted to illustrate the natural language accurate semantic analysis capabilities.Taking such three aspects into account as topic word extraction,probability distribution,and time complexity,the Python OS library was used to complete batch processing in order to remove duplication,stop words and non-Chinese word.An important sampling method was presented to optimize the LDA model,a new training model ISLDA was proposed to extract subject vocabulary,and cosine similarity was adopted to calculate the repetition rate.Thus,the optimization of the text duplicate checking algorithm model was better realized than previous works.Comparing the two models in terms of the topic word category and the distribution probability of each vocabulary of,the results show that the topic model is optimized by the new training model,the training accuracy of the calculation model is improved,and eventually the design method of text checking and analysis is ideally realized.
关 键 词:语义分析 查重模型 重要性采样 文本向量化 相似度计算
分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.222