融合多策略的短语识别方法  被引量:1

A Multi-Strategy-Based Phrase Recognition Method

在线阅读下载全文

作  者:胡小荣[1] 姚长青[1] 高影繁[1] HU Xiao-rong;YAO Chang-qing;GAO Ying-fan(Institute of Scientific and Technical Information of China. Beijing 100038. China)

机构地区:[1]中国科学技术信息研究所,北京100038

出  处:《情报科学》2019年第6期49-54,共6页Information Science

摘  要:【目的/意义】针对基于统计特征的短语识别方法存在的噪声问题,提出了融合多策略的短语识别方法。【方法/过程】该方法融合多统计量提取候选短语,并基于停用词表进行初步过滤,利用词向量较强的语义表达能力对候选短语进行过滤,以提高短语识别的准确率。在环保领域专利语料上进行实验,利用搜狗新闻语料与中文专利数据训练词向量库进行短语识别优化。【结果/结论】该方法对于语料规模较小以及阈值较低的结果过滤还有待进一步研究。实验结果表明,融合深度学习的方法提高了短语识别的准确率。【Purpose/significance】We propose a multi-strategy-based phrase recognition method in this paper to solve the noise problem on the phrase recognition method based on statistical features.【Method/process】The method firstly fuses multiple statistics to extract candidate phrases,and performs preliminary filtering based on the stop word list.It further uses the strong semantic expression ability of word vectors to filter candidate phrases,thereby improving the accuracy of phrase recognition.Finally,we carry out experiments using patent texts in the field of environmental protection as experimental data,and train two word vector library using Sougou news corpus and Chinese patents.【Result/conclusion】The method for further filtering of results with smaller corpus size and lower threshold remains to be further studied.The experiments prove that the method improves the accuracy of phrase recognition.

关 键 词:短语识别 词向量 Word2Vec 互信息 邻接熵 

分 类 号:G254.9[文化科学—图书馆学]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象