基于统计和规则的常用词的兼类识别研究  被引量:4

Study on multi-category of common words based on statistics and rules

在线阅读下载全文

作  者:夏静[1] 柴玉梅[1] 昝红英[1] 

机构地区:[1]郑州大学信息工程学院,河南郑州450001

出  处:《计算机工程与设计》2013年第2期654-659,共6页Computer Engineering and Design

基  金:国家自然科学基金项目(60970083);模式识别国家重点实验室开放课题基金项目;河南省科技创新人才杰出青年基金项目(104100510026)

摘  要:词的兼类问题是汉语词性标注中的关键问题之一。针对常用词的兼类识别进行研究,综合考虑了影响兼类词识别的不同特征,分别使用条件随机场模型、最大熵模型和k最近邻等统计方法,根据兼类词本身的特点以及其在上下文句子中的关系,同时针对不同的方法采用词语信息、词性信息等不同的特征模板分别对训练语料进行特征抽取,并取得了较好的实验结果;对一些识别结果不够理想的词又尝试了规则的方法,构建兼类词的规则,不断进行测试,改进规则库,在相同的条件下,得到了优于统计方法的实验结果。The problem of multiple syntactic category words is one of the key issues in part of speech tagging of Chinese. The reconginition on syntactic category of common words is mainly researched and the different characteristics is considered, which impact the recognition of multi category word. Firstly, three methods attempted, which are conditional random fields, Maxi mum Entropy and knearest neighbor method, and have achieved good results are obtained. According to the characteristics of the multicategory words and their relations in the context of the sentence, for the different methods, such as word information and part of speech information will be used as templates to extract features for the training data. The rule method also is tried to deal with some words, which recognition results are not ideal and the rules for the multicategory words are constructed, and by constantly testing to the rule base is improved. In the same conditions, it has been better than the results of statistical methods.

关 键 词:中文信息处理 兼类词 条件随机场 最大熵 K近邻 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象