检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:刘剑[1,2]
机构地区:[1]中国科学院计算技术研究所,北京100190 [2]解放军外国语学院,河南洛阳471003
出 处:《太赫兹科学与电子信息学报》2014年第6期870-873,878,共5页Journal of Terahertz Science and Electronic Information Technology
基 金:国家973计划资助项目(2012CB316303);国家自然科学基金资助项目(60933005)
摘 要:构建领域本体的首要任务是获取领域相关的概念,这些概念很多是由常用词典库中没有收录的领域合成词组成,因此抽取领域合成词对于领域本体的构建至关重要。本文基于语言规则和统计技术,提出一种结合改进互信息和语言模板的领域合成词抽取方法。首先利用改进的互信息算法抽取由多字词单位构成的高频次候选领域合成词,在此基础上,利用语言模板匹配抽取低频次候选领域合成词,最后由专家进行检验,得到领域合成词集。实验结果表明,该算法的领域合成词提取准确率达到88.22%,适用于从大规模网页文本中自动高效地抽取领域合成词。The primary task of constructing domain ontology is to obtain the relevant domain concepts.Many of these concepts are composed of domain compound words which are not included in the commondictionaries. So it is essential to extract domain compound words for the construction of domain ontology.Based on linguistic rules and statistical techniques, a hybrid extraction method combining the improvedmutual information and language templates is proposed. Firstly, it extracts high frequency candidatedomain compound words formed by a multi-word units using improved mutual information algorithm. Onthis basis, it extracts low frequency candidate domain compound words by language templates. Finally,domain compound words can be obtained through experts check. Experimental results show that thealgorithm achieves a precision of 88.22%, which indicates this technique is fit for automatically andeffectually extracting domain compound words from large corpora.
分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.229