串频统计和词形匹配相结合的汉语自动分词系统  被引量:65

An Chinese Word Automatic Segmentation System Based on String Frequency Statistics Combined with Word Matching

在线阅读下载全文

作  者:刘挺[1] 吴岩[1] 王开铸 

机构地区:[1]哈尔滨工业大学计算机系

出  处:《中文信息学报》1998年第1期17-25,共9页Journal of Chinese Information Processing

摘  要:本文介绍了一种汉语自动分词软件系统,该系统对原文进行三遍扫描:第一遍,利用切分标记将文本切分成汉字短串的序列;第二遍,根据各短串的每个子串在上下文中的频度计算其权值,权值大的子串视为候选词;第三遍,利用候选词集和一部常用词词典对汉字短串进行切分。实验表明,该分词系统的分词精度在1.5%左右,能够识别大部分生词。This paper presents a software system on Chinese automatic word segmentation.The original text is scanned three times:first,the text is cut into short Chinese character string sequence by cut marks;second,every short sting is weighted by its frequency in context,and the short strings weighted heavy are regarded as candidate words;third,short strings are segmented by candidate word set and everyday words.Experiments results shows that the segmentation precision of this word segmentation system is aboue 1.5%,and a large part of new words can be recognized correctly.This system is very suitable to document retrieval and other areas.

关 键 词:中文信息处理 自动分词 汉语 串频统计 词形匹配 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象