领域本体学习语料的自动获取与预处理方法研究被引量：5

Research on Automatic Acquisition and Preprocessing Methods of Domain Ontology Learning Corpus

作　　者：王思丽祝忠明刘巍杨恒 Wang Sili;Zhu Zhongming;Liu Wei;Yang Heng

机构地区：[1]中国科学院西北生态环境资源研究院文献情报中心 [2]中国科学院大学

出　　处：《图书馆学研究》2019年第20期54-64,共11页Research on Library Science

基　　金：中国科学院兰州文献情报中心2018年主任基金项目“基于深度学习的领域本体自动构建方法研究”(项目编号:Y8AJ012005);中国科学院2019年西部之光项目“开放学术资源的情景化组织与服务研究”(项目编号:Y9AX011001)的研究成果之一

摘　　要：实现领域语料的自动获取与预处理,为机器/深度学习驱动的领域本体自动构建提供数据及数据处理技术基础。首先,对所涉及语料的类型、获取方法及应用研究现状进行分析,提出多源异构领域语料的自动获取方法,包括基于Web Spider的网络开放领域语料和基于Web API的科学文献领域语料的自动获取等。其次,分析提出领域基础知识词典的自动构建方法,为语料预处理奠定基础。最后,通过对主流分词方法及开源分词工具进行测试与评估,提出基于增量训练HanLP-SP领域分词模型的多策略混合的自动分词与新词发现方法,并进行实验研究。方法能够有效获取到领域语料,并实现分词等预处理任务。Realizing the automatic acquisition and preprocessing of domain corpus can provide data and data processing technology basis for machine learning or depth learning driven domain ontology automatic construction.Firstly,the types of corpora,acquisition methods and application research status are analyzed.The automatic acquisition methods of multi-source heterogeneous domain corpus are proposed,including Web Spider-based network open domain corpus automatic acquisition and Web API-based scientific literature domain corpus automatic acquisition,etc.Secondly,an automatic construction method of domain basic knowledge dictionary is proposed,which lays a foundation for preprocessing corpus.Finally,through the test and evaluation of the mainstream word segmentation method and the open source word segmentation tool,a multi-strategy hybrid automatic word segmentation and new word discovery method based on the incremental training HanLP-SP domain segmentation model is proposed and experimental research is carried out.The method can effectively acquire the domain corpus and realize the preprocessing tasks such as word segmentation.

关键词：领域语料本体学习自动获取预处理分词

分类号：TP3[自动化与计算机技术—计算机科学与技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

领域本体学习语料的自动获取与预处理方法研究被引量：5

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

领域本体学习语料的自动获取与预处理方法研究 被引量：5

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

领域本体学习语料的自动获取与预处理方法研究被引量：5