基于多策略融合的专利术语自动抽取  被引量:4

PATENT TERM AUTO-EXTRACTION BASED ON MULTI-STRATEGY INTEGRATION

在线阅读下载全文

作  者:周绍钧 吕学强[1] 李卓[1] 都云程[1] 

机构地区:[1]北京信息科技大学网络文化与数字传播北京市重点实验室,北京100101

出  处:《计算机应用与软件》2015年第2期28-32,共5页Computer Applications and Software

基  金:国家自然科学基金项目(61171159;61271304);北京市教委科技发展计划重点项目暨北京市自然科学基金B类重点项目(KZ201311232037)

摘  要:专利术语自动抽取是知识抽取与文本挖掘的关键环节。在构建专利文献停用词表以及提取特定规则的基础上,抽取候选专利术语;通过分析专利术语与其所在句子的关联关系、相邻专利术语之间的影响以及常识性词语对专利术语抽取的干扰,分别提出基于PageRank思想的STRank权重计算方法、专利术语区别度计算方法以及知网义原信息降权方法,并融合上述方法对专利术语进行抽取。采用传感器领域的专利文献进行实验,在top-1400、top-1600级别上正确率为80.5%、79.7%,相对比CS+CC+CD方法分别提高了11.4%、9.5%。实验结果证明该多策略融合方法的有效性。Patent terms auto-extraction plays an important role in knowledge extraction and text mining. In this paper we extract candidate patent terms on the basis of constructing the stop-words inventory of patent literatures and specific rules extraction. Through analysing the associated relationship between patent terms and the sentences where they are, the influences between the adjacent patent terms and the interference of general words on patent terms extraction, we propose respectively the PageRank idea-based STRank weight calculation algorithm, the patent terms distinction computation technique and the weight-dropping method using Hownet sememe information, the above methods are then integrated to extract the patent terms. Patent literatures of sensor field are chosen for experiment, the precisions of top-1400 and top-1600 level are 80.5% and 79. 7% respectively, increasing 11. 4% and 9.5% in contrast to the result of CS + CC + CD method. The experimental results prove the effectiveness of this multi-strategy integration method.

关 键 词:专利术语 术语抽取 PAGERANK 术语区别度 义原信息 

分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象