检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
出 处:《计算机应用》2005年第9期2025-2027,共3页journal of Computer Applications
基 金:国家863计划资助项目(2002AA117010)
摘 要:汉语分词在汉语文本处理过程中是一个特殊而重要的组成部分。传统的基于词典的分词算法存在很大的缺陷,无法对未登录词进行很好的处理。基于概率的算法只考虑了训练集语料的概率模型,对于不同领域的文本的处理不尽如人意。文章提出一种基于上下文信息提取的概率分词算法,能够将切分文本的上下文信息加入到分词概率模型中,以指导文本的切分。这种切分算法结合经典n元模型以及EM算法,在封闭和开放测试环境中分别取得了比较好的效果。Chinese segmentation is a special and important issue in Chinese texts processing. The traditional segmentation methods based on an existing dictionary have an obvious defect when they are used to segment texts which may contain words unknown to the dictionary. And the probabilistic methods those consider the probabilistic model of the training set only also do a bad job on the texts of a specific domain. In this paper, a probabilistic segmentation method based on extracting context information was proposed, which adds the context information of the segmenting text into the segmentation probabilistic model so as to guide the processing. The method combining n-gram model and EM algorithm achieves a good effect in the close and opening test.
分 类 号:TP391.2[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.104