检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:陶玥 余丽[1,3] 张润杰 Tao Yue;Yu Li;Zhang Runjie(National Science Library,Chinese Academy of Sciences,Beijing 100190,China;Department of Library,Information and Archives Management,School of Economics and Management,University of Chinese Academy of Sciences,Beijing 100190,China;State Key Laboratory of Resources and Environmental Information System,Beijing 100101,China;Electronics and Computer Science,University of Southampton,Southampton SO171BJ,UK)
机构地区:[1]中国科学院文献情报中心,北京100190 [2]中国科学院大学经济与管理学院图书情报与档案管理系,北京100190 [3]中国科学院地理科学与资源研究所资源与环境信息系统国家重点实验室,北京100101 [4]南安普顿大学电子与计算机科学学院,南安普顿SO171BJ
出 处:《数据分析与知识发现》2020年第10期134-143,共10页Data Analysis and Knowledge Discovery
基 金:国家自然科学基金青年科学基金项目“中文网络文本的地理实体语义关系标注与评价”(项目编号:41801320);资源与环境信息系统国家重点实验室开放基金的研究成果之一。
摘 要:【目的】在标注语料匮乏的情况下,利用主动学习策略,探索科技文献信息抽取的有效解决方案。【方法】设计一种融合主动学习的神经网络模型架构,将三种代表性的主动学习策略(MARGIN,NSE,MNLP)和新提出的LWP策略与神经网络信息抽取模型(CNN-BiLSTM-CRF)结合,研究适用于标注语料匮乏的任务驱动型信息抽取方法。【结果】在主动学习引导下,仅选择性标注10%~30%数据,即可达到神经网络模型训练100%标注数据的效果,可大大降低标注语料库构建过程中的人力成本。【局限】人工智能领域科技文献数据集规模小、噪声多,信息抽取模型的精确率低。【结论】主动学习策略指导下的神经网络模型,大幅缩减了所需标注语料库的规模。对比4种主动学习策略发现:MNLP策略显著优于其他策略;MARGIN策略在初始迭代阶段表现优异且能辨别出低价值的实例;基于句长规范化的MNLP策略能促进模型的稳定性;LWP适用于语义标签占比大的数据集。[Objective] This paper explores methods of extracting information from scientific literature with the help of active learning strategies, aiming to address the issue of lacking annotated corpus. [Methods] We constructed our new model based on three representative active learning strategies(MARGIN, NSE, MNLP) and one novel LWP strategy, as well as the neural network model(namely CNN-BiLSTM-CRF). Then, we extracted the task and method related information from texts with much fewer annotations. [Results] We examined our model with scientific articles with 10%~30% selectively annotated texts. The proposed model yielded the same results as those of models with 100% annotated texts. It significantly reduced the labor costs of corpus construction. [Limitations] The number of scientific articles in our sample corpus was small, which led to low precision issues. [Conclusions] The proposed model significantly reduces its reliance on the scale of annotated corpus. Compared with the existing active learning strategies, the MNLP yielded better results and normalizes the sentence length to improve the model’s stability. In the meantime, MARGIN performs well in the initial iteration to identify the low-value instances, while LWP is suitable for dataset with more semantic labels.
分 类 号:TP393[自动化与计算机技术—计算机应用技术] G202[自动化与计算机技术—计算机科学与技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:3.133.112.22