现代汉语通用分词系统中歧义切分的实用技术被引量：19

Disambiguation in a Modern Chinese General-Purpose Word Segmentation System

机构地区：[1]北京工业大学计算机学院,北京100022 [2]北京语言大学信息科学学院,北京100083

出　　处：《计算机研究与发展》2006年第6期1122-1128,共7页Journal of Computer Research and Development

基　　金：国家自然科学基金项目(60272055);国家"八六三"高技术研究发展计划基金项目(2001AA114111);教育部科学技术研究重点基金项目(00128);教育部人文社会科学重点研究基地重大项目(02JAZJD740007)~~

摘　　要：歧义切分技术是中文自动分词系统的关键技术之一·特别是在现代汉语通用分词系统(GPWS)中,允许用户动态创建词库、允许多个用户词库同时参与切分,这给歧义切分技术提出了更高的实用性要求·从大规模的真实语料库中,考察了歧义(特别是交集型歧义)的分布情况和特征;提出了一种改进的正向最大匹配歧义字段发现算法;并根据GPWS的需求,提出了一种“规则+例外”的实用消歧策略·对1亿字《人民日报》语料(约234MB)中的交集型歧义字段进行了穷尽式的抽取,并随机的对上述策略进行了开放性测试,正确率达99%·Disambiguation is one of the most important parts of segment systems in Chinese. A Chinese general-purpose word segmentation （GPWS） system demands higher capacity of disambiguation techniques particularly, because it has functions such as allowing users to create their own dictionaries dynamically and employing multiple user＇ s dictionaries to word segmentation. Based on inspection of the distributions and characteristics of ambiguity fragments （especially overlapping ambiguity fragments） in large-scale real corpus, an improved forward maximum match algorithm for ambiguity fragment detection, as well as a practical ＂ rules ＋ exceptions＂ disambiguation strategy, are proposed in this paper. An exhaustive extraction has been made of the overlapping ambiguity sections （about 2.4 million occurrences） from a People＇s Daily corpus of 100 million characters （234MB approximately）, and open-ended experiments on the above strategy randomly were carried out, which achieved accuracy average of 99 %.

关键词：中文信息处理通用分词系统歧义切分

分类号：TP391.12[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

现代汉语通用分词系统中歧义切分的实用技术被引量：19

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

现代汉语通用分词系统中歧义切分的实用技术 被引量：19

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

现代汉语通用分词系统中歧义切分的实用技术被引量：19