基于职业特征的人名消歧算法  被引量:2

Name Disambiguation Algorithm Based on Clustering Occupational Characteristics

在线阅读下载全文

作  者:阳怡林[1] 周杰[1] 李弼程[1] 李爱国 

机构地区:[1]信息工程大学 [2]71239部队

出  处:《信息工程大学学报》2016年第5期548-554,共7页Journal of Information Engineering University

基  金:国家社会科学基金资助项目(14BXW028)

摘  要:职业是人物实体的代表性特征,能够有效地区分人物实体。传统人名消歧算法仅把职业当作一个普通的特征,忽视了它的重要性。针对以上问题,提出了基于职业特征的人名消歧算法。首先通过互联网手动构建基础职业词典;其次以维基百科的所有中文页面为训练语料,通过词激活力模型扩展基础职业词典得到职业特征词典;然后从文本中提取职业特征,并抽取人名和作品名作为其补充特征,弥补文本中职业特征缺失和同一人物具有多个职业的问题;最后采用凝聚层次聚类实现人名消歧。在CLP2010的人名消歧训练语料上进行实验,结果表明文章算法能够有效地实现人名消歧。Occupation is the representative feature of character entities and can effectively distinguish them. Considering that the traditional algorithm of name disambiguation takes the occupation as a common feature and ignores its importance, this paper puts forward an algorithm of name disambiguation based on occupation. Firstly, a basic occupation dictionary is built manually through the internet; secondly, all Chinese Wikipedia pages are used as training corpus and a basic occupation dictionary is derived by extending the word activation force model; then, occupation is extracted as a feature from the text, supplemented by names and works to make up for the problems of occupation missing and the same person having multiple occupations; finally, name disambiguation is imple-mented by agglomerative hierarchical clustering. Experimental results on CLP2010 of Chinese names disambiguation evaluation corpus show that our algorithm is effective.

关 键 词:职业特征 亲和度 人名消歧 词激活力 凝聚层次聚类 

分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象