一种基于提取上下文信息的分词算法被引量：9

Segmentation algorithm for Chinese based on extraction of context information

出　　处：《计算机应用》2005年第9期2025-2027,共3页journal of Computer Applications

基　　金：国家863计划资助项目(2002AA117010)

摘　　要：汉语分词在汉语文本处理过程中是一个特殊而重要的组成部分。传统的基于词典的分词算法存在很大的缺陷,无法对未登录词进行很好的处理。基于概率的算法只考虑了训练集语料的概率模型,对于不同领域的文本的处理不尽如人意。文章提出一种基于上下文信息提取的概率分词算法,能够将切分文本的上下文信息加入到分词概率模型中,以指导文本的切分。这种切分算法结合经典n元模型以及EM算法,在封闭和开放测试环境中分别取得了比较好的效果。Chinese segmentation is a special and important issue in Chinese texts processing. The traditional segmentation methods based on an existing dictionary have an obvious defect when they are used to segment texts which may contain words unknown to the dictionary. And the probabilistic methods those consider the probabilistic model of the training set only also do a bad job on the texts of a specific domain. In this paper, a probabilistic segmentation method based on extracting context information was proposed, which adds the context information of the segmenting text into the segmentation probabilistic model so as to guide the processing. The method combining n-gram model and EM algorithm achieves a good effect in the close and opening test.

关键词：中文分词 N元模型上下文信息

分类号：TP391.2[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

一种基于提取上下文信息的分词算法被引量：9

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

一种基于提取上下文信息的分词算法 被引量：9

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

一种基于提取上下文信息的分词算法被引量：9