面向本体学习的中文专利术语抽取研究  被引量:18

A Study on Chinese Patent Terms Extraction for Ontology Learning

在线阅读下载全文

作  者:王昊[1,2] 王密平 苏新宁[1,2] 

机构地区:[1]南京大学信息管理学院,南京210023 [2]南京大学江苏省数据工程与知识服务重点实验室,南京210023

出  处:《情报学报》2016年第6期573-585,共13页Journal of the China Society for Scientific and Technical Information

基  金:江苏省自然科学基金项目"面向专利预警的中文本体学习研究"(BK20130587);国家社科重大招标项目"面向突发事件应急决策的快速响应情报体系研究"(13&ZD174)等的资助

摘  要:本文提出了一个无或少训练语料环境下抽取中文专利术语的解决方案。以"钢铁冶金"领域专利文本为例,首先总结了该领域中文术语的基本特征,进而建立了基于字角色标注的机器学习术语识别模型,并通过循环迭代方式重复条件随机场的学习过程,最大限度避免因核心词汇库代替人工带来的标注不准确不充分问题;在此基础上,进一步依据合成规则构造新术语,并经过领域专家确认后添加至核心词汇库中。经过实验论证,基于字角色标注的基本术语抽取F1值高于94%,而基于合成规则的复杂术语抽取准确率也可达到75%。在7597件专利的题名和摘要文本中,最终可获得中文基本术语244672个,合成术语61536个,为领域本体的构建奠定了基础。This paper proposes a solution for the extraction of Chinese patent terms in the context of more or less training corpus. Taking the patent texts from the field of iron and steel metallurgy as an example, the basic characteristics of Chinese terms from this field are summarized firstly, then the model used for terms recognition based on character role labeling by machine learning is established, and the CRFs learning process is repeated through the way of snowball to maximum limit avoid the problem on labeling inaccurate and inadequate taking by core vocabulary instead of manual. On this basis, new terms are constructed further according to the combination rules, and they will be added to core vocabulary after checked and confirmed by domain experts. After the experimental demonstration, Fl-value of the basic terms extraction based on character role labeling is higher than 94% , and the accuracy of the complex terms extraction based on combination rules can also reach 75%. In the texts of title and abstract of patent, ultimately, 244672 Chinese basic terms and 61536 combination terms could be obtained, which lay the foundation for the domain ontology construction.

关 键 词:中文专利术语 机器学习 条件随机场 字角色标注 循环迭代 合成规则 本体学习 

分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象