检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:韩红旗[1,2] 桂婕 张运良[1,2] 翁梦娟[1,2] 薛陕 悦林东 Han Hongqi;Gui Jie;Zhang Yunliang;Weng Mengjuan;Xue Shan;Yue Lindong(Institute of Scientific and Technical Information of China,Beijing 100038;Key Laboratory of Rich-media Knowledge Organization and Service of Digital Publishing Content,National Press and Publication Administration,Beijing 100038)
机构地区:[1]中国科学技术信息研究所,北京100038 [2]富媒体数字出版内容组织与知识服务重点实验室(国家新闻出版署),北京100038
出 处:《情报学报》2022年第5期475-485,共11页Journal of the China Society for Scientific and Technical Information
基 金:中国科学技术信息研究所创新研究基金面上项目“基于论文学科分类的跨学科合作网络研究”(MS2022-04);中国工程科技知识中心建设项目“知识组织体系建设”(CKCEST-2022-1-29)。
摘 要:现有的主题标引方法一般只能抽取文本中出现的词汇,无法从几万或数十万主题词中选择语义关联强且未出现的词汇;基于机器学习的多标签分类算法则需要每一个标签下有训练数据,限制了它们在主题标引上的应用。面向大规模主题词在海量文献上的标引需求,提出一个基于分布式词向量的混合型自动标引方法,利用大规模语料训练的词向量生成同维度的主题词表示向量和文本表示向量,实现主题词与文本语义相似度的计算。基于大规模语料构建主题词与普通词的映射表,使文本向量只和少量的语义强相关主题词向量比较,大大减少了计算量,提高了标引效率。开发的自动标引工具对近亿篇文献进行了主题标引,达到了较高的速度。与结巴关键词的实验对比结果显示,本文方法抽取的主题词与作者关键词重合度较低,且在去除结巴关键词中的非主题词后,取得了比结巴关键词更高的标引准确率;与人工标引的实验对比结果显示,随着人工标引词数量的增加,本文方法的效果、结果与人工标引结果的一致性在不断增加。Existing subject indexing methods can only extract words that appear in the text but cannot select the words that have strong semantic correlation and do not appear from tens of thousands or hundreds of thousands of subject words.The multi-label text classification algorithm based on machine learning needs training data under each label,limiting its application in subject indexing.Aiming at the indexing requirements of large-scale subject words in massive documents,this study proposes an automatic indexing method based on the distributed word vector technique,which uses the word vector trained by a large-scale corpus to generate representation vectors for subject words and text documents of the same dimension and realizes the calculation of semantic similarities between them.The mapping table between subject and common words is constructed based on a large-scale corpus,so that the text vector is only compared with a small number of semantically strongly related subject word vectors,which significantly reduces the amount of calculation and improves the indexing efficiency.The developed automatic indexing tool has been applied to subject indexing on nearly 100 million documents and has achieved satisfactory speed.Compared with the Jieba keywords,the proposed method has a lower coincidence degree between the subject words and author keywords and achieves better indexing accuracy than the Jieba keywords after removing the non-subject words in the Jieba keywords.
关 键 词:主题标引 分布式词向量 多标签文本分类 关键词抽取 语义标签
分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.222