检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:崔彤彤 崔荣一[1] CUI Tongtong;CUI Rongyi(Intelligent Information Processing Lab,Department of Computer Science and Technology,Yanbian University,Yanji,Jilin 133000,China)
机构地区:[1]延边大学计算机科学与技术学院智能信息处理研究室,吉林延吉133000
出 处:《中文信息学报》2018年第5期74-79,共6页Journal of Chinese Information Processing
基 金:国家语委"十二五"科研规划项目(YB125-178);吉林省科技发展计划项目(20140101186JC)
摘 要:网络化大数据时代的到来丰富了网络空间中的信息资源,然而由于数据资源类型的多样性及其增长的快速性,给网络空间的存储和信息资源的有效利用带来了压力和挑战。该文提出了一种基于潜在语义分析的文本指纹提取方法,该方法是对数据信息的一种压缩表示,是针对目前指纹提取方法语义缺失的一种改进。该方法主要通过奇异值分解获取原始文档的潜在语义特征,然后将原文档向量空间转换到与其对应的潜在语义空间,再根据随机超平面原理将该空间的文档转换成二进制数字指纹,最终用汉明距离来衡量指纹间的差异程度。实验以中国知网上的学术论文作为数据对象,通过对论文文本进行相似度实验和聚类实验对该文提出的方法进行实验验证。实验结果表明该方法能够较好地表征文档语义信息,进而验证了文本语义压缩表示的准确性和有效性。The arrival of the era of network and big data enriches the information resources in cyberspace.However,the diversity and the rapid growth of data bring pressure and challenge to the storage and the effective utilization of information resources.A text fingerprint extraction method based on latent semantic analysis was presented in this paper.The proposed method is a compression representation of data information,and it is an improvement on the semantic deficiency of current fingerprint extraction methods.By this method,the semantic latent semantic features of document were obtained using singular value decomposition,and furthermore,the original document vector space was transformed into the corresponding latent semantic space.Finally,according to the random hyperplane principle,the document in the space was transformed into binary digital fingerprint,and the difference between fingerprints was measured by Hamming distance.The proposed method was verified by the similarity experiments and clustering experiments with the academic literature from CNKI.The experimental results show that the method can better characterize the semantic information of the document with accurate and effective compressed representation.
关 键 词:文本指纹 奇异值分解 潜在语义分析 随机超平面原理
分 类 号:TP391[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.229