融合聚类触发对特征的最大熵词性标注模型  被引量:20

Fusion of Clustering Trigger-Pair Features for POS Tagging Based on Maximum Entropy Model

在线阅读下载全文

作  者:赵岩[1] 王晓龙[1] 刘秉权[1] 关毅[1] 

机构地区:[1]哈尔滨工业大学计算机科学与技术学院,哈尔滨150001

出  处:《计算机研究与发展》2006年第2期268-274,共7页Journal of Computer Research and Development

基  金:国家自然科学基金项目(60175020);国家"八六三"高技术研究发展计划基金项目(2002AA117010-09)~~

摘  要:为解决传统HMM词性标注模型不能包含远距离词特征的问题,提出了形如“WA→WB/TB”的触发对来承载远距离词特征信息,并采用平均互信息量度对触发对特征进行选择·在最大熵框架下,将选择后的触发对特征加入到词性标注系统中·利用矢量空间模型提供的语义相似度计算功能进行词语聚类,聚类的结果和语义词典融合,建立聚类触发对特征用来解决触发词“WA”的数据稀疏问题·实验结果表明,与HMM相比,融合了聚类触发对特征的最大熵模型标注错误率减少了34%·Part-of-speech (POS) information is demanded before constructing more complex analysis. Traditional POS tagger is based on hidden Markov model (HMM), however the HMM can't include the long-distance lexieal features which can help to predict the fight POS. A kind of "WA→WB/TB" trigger- pair, which contains the long-distance lexical information, is proposed to solve this problem firstly, and then a better correlation measure-average mutual information (AMI) instead of mutual information (MI) is used to extract trigger pairs from the training corpus. To cope with the sparseness problem of trigger word "WA", word clustering is made to build clustering trigger-pairs by semantic similarity calculation which is provided by the vector space model. Finally, the high-quality clustering trigger-pairs are added to the POS tagging system as a new kind of features under the maximum entropy frame-work. The experiment shows that tagging error of the new model is reduced by 34 %, compared with the HMM. The idea of the paper can be applied to Pinyin-to-character conversion and word sense disambiguation problem too.

关 键 词:词性标注 最大熵模型 矢量空间模型 语义相似度计算 触发对 

分 类 号:TP18[自动化与计算机技术—控制理论与控制工程] TP391.2[自动化与计算机技术—控制科学与工程]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象