基于多知识源的中文词法分析系统被引量：29

Research on Chinese Lexical Analysis System by Fusing Multiple Knowledge Sources

出　　处：《计算机学报》2007年第1期137-145,共9页Chinese Journal of Computers

基　　金：国家自然科学基金重点项目"问答式信息检索的理论与方法"(60435020);国家自然科学基金(60504021)资助.

摘　　要：汉语词法分析是中文自然语言处理的首要任务.文中深入研究中文分词、词性标注、命名实体识别所面临的问题及相互之间的协作关系,并阐述了一个基于混合语言模型构建的实用汉语词法分析系统.该系统采用了多种语言模型,有针对性地处理词法分析所面临的各个问题.其中分词系统参加了2005年第二届国际汉语分词评测,在微软亚洲研究院、北京大学语料库开放测试中,分别获得F量度为97.2%与96.7%.而在北京大学标注的《人民日报》语料库的开放评测中,词性标注获得96.1%的精确率,命名实体识别获得的F量度值为88.6%.Chinese lexical analysis is the foundation task for most Chinese natural language processing. In this paper, word segmentation, POS tagging, named entity recognition and their relation are well discussed. Moreover, a pragmatic lexical analysis system based on mixed language models is presented, which adopts many models, such as n-gram, hidden Markov model, maximum entropy model, support vector machine and conditional random fields, they have good performance in the special sub-tasks. The Word Segmenter participated in the Second International Chinese Word Segmentation Bakeoff in 2005, and achieved 97.2% and 96.7%in terms of F- measure in MSR and PKU open test respectively. While the POS tagging and named entity recognition modules achieved 96.1 % in precision and 88. 6 % in F-measure respectively in open test with the corpus that came from six-month corpora of Chinese Peoples＇ Daily.

关键词：词法分析汉语分词词性标注命名实体识别语言模型

分类号：TP391[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于多知识源的中文词法分析系统被引量：29

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于多知识源的中文词法分析系统 被引量：29

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

基于多知识源的中文词法分析系统被引量：29