检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
机构地区:[1]信息工程大学 [2]71239部队
出 处:《信息工程大学学报》2016年第5期548-554,共7页Journal of Information Engineering University
基 金:国家社会科学基金资助项目(14BXW028)
摘 要:职业是人物实体的代表性特征,能够有效地区分人物实体。传统人名消歧算法仅把职业当作一个普通的特征,忽视了它的重要性。针对以上问题,提出了基于职业特征的人名消歧算法。首先通过互联网手动构建基础职业词典;其次以维基百科的所有中文页面为训练语料,通过词激活力模型扩展基础职业词典得到职业特征词典;然后从文本中提取职业特征,并抽取人名和作品名作为其补充特征,弥补文本中职业特征缺失和同一人物具有多个职业的问题;最后采用凝聚层次聚类实现人名消歧。在CLP2010的人名消歧训练语料上进行实验,结果表明文章算法能够有效地实现人名消歧。Occupation is the representative feature of character entities and can effectively distinguish them. Considering that the traditional algorithm of name disambiguation takes the occupation as a common feature and ignores its importance, this paper puts forward an algorithm of name disambiguation based on occupation. Firstly, a basic occupation dictionary is built manually through the internet; secondly, all Chinese Wikipedia pages are used as training corpus and a basic occupation dictionary is derived by extending the word activation force model; then, occupation is extracted as a feature from the text, supplemented by names and works to make up for the problems of occupation missing and the same person having multiple occupations; finally, name disambiguation is imple-mented by agglomerative hierarchical clustering. Experimental results on CLP2010 of Chinese names disambiguation evaluation corpus show that our algorithm is effective.
关 键 词:职业特征 亲和度 人名消歧 词激活力 凝聚层次聚类
分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.222