大规模主题词自动标引方法被引量：5

Automatic Indexing of Large Scale Subject Words

作　　者：韩红旗[1,2] 桂婕张运良[1,2] 翁梦娟[1,2] 薛陕悦林东 Han Hongqi;Gui Jie;Zhang Yunliang;Weng Mengjuan;Xue Shan;Yue Lindong(Institute of Scientific and Technical Information of China,Beijing 100038;Key Laboratory of Rich-media Knowledge Organization and Service of Digital Publishing Content,National Press and Publication Administration,Beijing 100038)

机构地区：[1]中国科学技术信息研究所,北京100038 [2]富媒体数字出版内容组织与知识服务重点实验室(国家新闻出版署),北京100038

出　　处：《情报学报》2022年第5期475-485,共11页Journal of the China Society for Scientific and Technical Information

基　　金：中国科学技术信息研究所创新研究基金面上项目“基于论文学科分类的跨学科合作网络研究”(MS2022-04);中国工程科技知识中心建设项目“知识组织体系建设”(CKCEST-2022-1-29)。

摘　　要：现有的主题标引方法一般只能抽取文本中出现的词汇,无法从几万或数十万主题词中选择语义关联强且未出现的词汇;基于机器学习的多标签分类算法则需要每一个标签下有训练数据,限制了它们在主题标引上的应用。面向大规模主题词在海量文献上的标引需求,提出一个基于分布式词向量的混合型自动标引方法,利用大规模语料训练的词向量生成同维度的主题词表示向量和文本表示向量,实现主题词与文本语义相似度的计算。基于大规模语料构建主题词与普通词的映射表,使文本向量只和少量的语义强相关主题词向量比较,大大减少了计算量,提高了标引效率。开发的自动标引工具对近亿篇文献进行了主题标引,达到了较高的速度。与结巴关键词的实验对比结果显示,本文方法抽取的主题词与作者关键词重合度较低,且在去除结巴关键词中的非主题词后,取得了比结巴关键词更高的标引准确率;与人工标引的实验对比结果显示,随着人工标引词数量的增加,本文方法的效果、结果与人工标引结果的一致性在不断增加。Existing subject indexing methods can only extract words that appear in the text but cannot select the words that have strong semantic correlation and do not appear from tens of thousands or hundreds of thousands of subject words.The multi-label text classification algorithm based on machine learning needs training data under each label,limiting its application in subject indexing.Aiming at the indexing requirements of large-scale subject words in massive documents,this study proposes an automatic indexing method based on the distributed word vector technique,which uses the word vector trained by a large-scale corpus to generate representation vectors for subject words and text documents of the same dimension and realizes the calculation of semantic similarities between them.The mapping table between subject and common words is constructed based on a large-scale corpus,so that the text vector is only compared with a small number of semantically strongly related subject word vectors,which significantly reduces the amount of calculation and improves the indexing efficiency.The developed automatic indexing tool has been applied to subject indexing on nearly 100 million documents and has achieved satisfactory speed.Compared with the Jieba keywords,the proposed method has a lower coincidence degree between the subject words and author keywords and achieves better indexing accuracy than the Jieba keywords after removing the non-subject words in the Jieba keywords.

关键词：主题标引分布式词向量多标签文本分类关键词抽取语义标签

分类号：TP391.1[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

大规模主题词自动标引方法被引量：5

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

大规模主题词自动标引方法 被引量：5

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

大规模主题词自动标引方法被引量：5