一种领域合成词的抽取方法  

A method of domain compound words extraction

在线阅读下载全文

作  者:刘剑[1,2] 

机构地区:[1]中国科学院计算技术研究所,北京100190 [2]解放军外国语学院,河南洛阳471003

出  处:《太赫兹科学与电子信息学报》2014年第6期870-873,878,共5页Journal of Terahertz Science and Electronic Information Technology

基  金:国家973计划资助项目(2012CB316303);国家自然科学基金资助项目(60933005)

摘  要:构建领域本体的首要任务是获取领域相关的概念,这些概念很多是由常用词典库中没有收录的领域合成词组成,因此抽取领域合成词对于领域本体的构建至关重要。本文基于语言规则和统计技术,提出一种结合改进互信息和语言模板的领域合成词抽取方法。首先利用改进的互信息算法抽取由多字词单位构成的高频次候选领域合成词,在此基础上,利用语言模板匹配抽取低频次候选领域合成词,最后由专家进行检验,得到领域合成词集。实验结果表明,该算法的领域合成词提取准确率达到88.22%,适用于从大规模网页文本中自动高效地抽取领域合成词。The primary task of constructing domain ontology is to obtain the relevant domain concepts.Many of these concepts are composed of domain compound words which are not included in the commondictionaries. So it is essential to extract domain compound words for the construction of domain ontology.Based on linguistic rules and statistical techniques, a hybrid extraction method combining the improvedmutual information and language templates is proposed. Firstly, it extracts high frequency candidatedomain compound words formed by a multi-word units using improved mutual information algorithm. Onthis basis, it extracts low frequency candidate domain compound words by language templates. Finally,domain compound words can be obtained through experts check. Experimental results show that thealgorithm achieves a precision of 88.22%, which indicates this technique is fit for automatically andeffectually extracting domain compound words from large corpora.

关 键 词:领域本体 互信息 语言模板 领域合成词 

分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象