基于词向量与多特征融合的农业文本自动标引研究  

Research on Agricultural Text Automatic Indexing Based on the Fusion of Word Vectors and Multi-features

在线阅读下载全文

作  者:香慧敏 白涛[1,2] 李东亚 马楠 XIANG Hui-min;BAI Tao;LI Dong-ya;MA Nan(College of Computer and Information Engineering,Xinjiang Agricultural University,Urumqi 830052,China;Xinjiang Agricultural Information Engineering Technology Research Center,Urumqi 830052,China)

机构地区:[1]新疆农业大学计算机与信息工程学院,乌鲁木齐830052 [2]新疆农业信息化工程技术研究中心,乌鲁木齐830052

出  处:《新疆农业大学学报》2022年第6期486-492,共7页Journal of Xinjiang Agricultural University

基  金:新疆维吾尔自治区重点研发项目(2017B01006-1)。

摘  要:针对TF-IDF算法未考虑到文本关键词分布以及受不均衡数据集影响的问题,提出了一种多特征融合的术语频率-逆文档逆词频率(TF-IDIWF)自动标引算法,并与TF-IDF、TF-IWF、TextRank、LSI及LDA基线算法进行对比验证。利用python爬虫技术获取20万条农业文本语料以csv文件格式存储,用于训练农业词向量模型,随机抽取政策法规类、新闻资讯类、市场类、科技类文章各1000篇并进行多人独立标注,标注词个数为5~13个,将标注结果整合归纳后生成AGRI2020农业文本均衡数据集。为验证TF-IDIWF算法能否降低不均衡数据集带来的影响,从AGRI2020中随机抽取新闻资讯类1000篇,其余3个类别各100篇构建出农业文本不均衡数据集。首先利用TF-IDF融合词向量技术对分词后的词语进行过滤、筛选以及近义词归并,再引入词位置、词性及词跨度特征权重融合逆文档频率及逆词频率对农业文本进行关键词自动标引。结果表明,在不均衡数据集上的F1值为57.08%,相较于TF-IDF、TF-IWF算法分别提高了9.12%、1.24%;在均衡数据集上的平均F1值为60.80%,相较于TF-IDF、TextRank、LSI及LDA算法分别提高了10.48%、10.04%、18.83%、14.89%。多特征融合的TF-IDIWF自动标引算法能有效提高农业文本标引准确性。To solve the problem that TF-IDF algorithm does not consider the distribution of text keywords which is affected by unbalanced data sets,a multi feature fusion term frequency inverse document inverse word frequency(TF-IDIWF)automatic indexing algorithm was proposed,and is compared with TF-IDF,TF-IWF,TextRank,LSI and LDA baseline algorithms.Using python crawler technology,200000 agricultural text corpora were obtained and stored in csv file format for training the agricultural word vector model.1000 articles of policy and regulation,news and information,market and science and technology were randomly selected and labeled by multiple people independently.The number of labeled words was 5-13.The labeled results were summarized to generate AGRI2020 agricultural text balanced data set.In order to verify whether TF-IDIWF algorithm in this project can reduce the impact of unbalanced data sets,1000 pieces of news and information were randomly selected from AGRI2020,and 100 pieces of other three categories were used to construct agricultural text unbalanced data sets.First,TF-IDF fusion word vector technology was used to filter,dressed by screening and merge the words after word segmentation,and then the weight of word position,part of speech and word span feature fusion inverse document frequency and inverse word frequency were introduced to automatically index agricultural text with keywords.The results showed that the F1 value on the unbalanced data set was 57.08%,which was 9.12%and 1.24%higher than those of TF-IDF and TF-IWF,respectively;The average F1 value on the balanced data set was 60.80%,which was 10.48%,10.04%,18.83%and 14.89%higher than those of TF-IDF,TextRank,LSI and LDA algorithms,respectively.The research results showed that TF-IDIWF automatic indexing algorithm based on multi feature fusion can effectively improve the accuracy of agricultural text indexing.

关 键 词:词向量 多特征融合 TF-IDIWF 自动标引 农业文本 

分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象