面向标注数据稀缺专利文献的科技实体抽取  被引量:4

Technology Entity Extraction of Patent Literature with Limited Annotated Data

在线阅读下载全文

作  者:原之安 彭甫镕 谷波 钱宇华[1,2,3] YUAN Zhi′an;PENG Furong;GU Bo;QIAN Yuhua(Research Institute of Big Data Science and Industry, Shanxi University, Taiyuan 030006, China;Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education, Shanxi University, Taiyuan 030006, China;School of Computer and Information Technology, Shanxi University, Taiyuan 030006, China)

机构地区:[1]山西大学大数据科学与产业研究院,山西太原030006 [2]山西大学计算智能与中文信息处理教育部重点实验室,山西太原030006 [3]山西大学计算机与信息技术学院,山西太原030006

出  处:《郑州大学学报(理学版)》2021年第4期61-68,共8页Journal of Zhengzhou University:Natural Science Edition

基  金:国家自然科学基金项目(61672332);山西省重点研发计划项目(201903D421003);山西省教育厅科技成果转化培育项目(2020CG001)。

摘  要:专利中的科技实体是指专利文献中富含科技信息的词汇,抽取专利中的科技实体对科研工作者提高科研效率、企业布局专利体系都至关重要。提出一种基于半监督学习框架与命名实体识别模型相结合的科技实体抽取方法,半监督学习能够利用无标记数据的优势弥补标注数据稀缺的缺陷,利用大量的专利语料在通用领域的BERT模型基础上进行预训练,得到适用于专利领域的BERT模型BERT-Patent,有效提升模型对专利中科技实体的抽取性能。在专利数据集上的实验结果表明,提出的方法在准确率、召回率、F1值指标上分别提高了6.37%、2.99%、4.63%;在人民日报数据集上准确率、召回率、F1值分别提高了2.87%、1.24%、2.07%。Technological information contained in patent documents was in the form of vocabulary.These vocabulary was called patent technology entity.Extracting the entity accurately from the patent was crucial for scientists to improve the efficiency of scientific research,and for enterprises to deploy the patent system.A method of extracting scientific and technological entity was proposed based on semi-supervised learning framework and named entity recognition model.It took advantage of semi-supervised learning to make up for the insufficiency of annotated data.At the same time,BERT-Patent model was pre-trained from the generic BERT model over a large patent corpus,in order to improve the feature extraction performance effectively in patent context.The proposed method had superior performance in terms of accuracy,recall rate,and F1 measure;specifically,it was scored 6.37%,2.99%,and 4.63%higher respectively on the patent dataset,and 2.87%,1.24%,and 2.07%higher respectively on People′s Daily dataset.

关 键 词:科技实体 专利挖掘 数据稀缺 BERT 半监督学习 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象