基于改进BP网络的中文歧义字段分词方法研究  被引量:12

Research on ambiguous words segmentation algorithm based on improved BP neural network

在线阅读下载全文

作  者:张利[1] 张立勇[1] 张晓淼[1] 耿铁锁[2] 岳宗阁[3] 

机构地区:[1]大连理工大学电子与信息工程学院,辽宁大连116024 [2]大连理工大学国有资产处,辽宁大连116024 [3]大连理工大学附属医院,辽宁大连116024

出  处:《大连理工大学学报》2007年第1期131-135,共5页Journal of Dalian University of Technology

基  金:国家自然科学基金资助项目(60573172)

摘  要:文本挖掘中中文歧义字段的自动分词是计算机科学面临的一个难题.针对汉语书写时按句连写,词间无间隙,歧义字段分词困难的特点,对典型歧义中所蕴含的语法现象进行了归纳总结,建立了供词性编码使用的词性代码库.以此为基础,通过对具有特殊语法规则的歧义字段中的字、词进行代码设定,转化为神经网络能够接受的输入向量表示形式,然后对样本进行训练,通过改进BP神经网络的自学习来掌握这些语法规则.训练结果表明:算法在歧义字段分词上达到了93.13%的训练精度和92.50%的测试精度.In the text mining, the technology of Chinese automatic word segmentation is a difficult problem that the computer science has to face. Aiming at the characteristics of Chinese writing, such as no space between words, continuous writing in sentences and difficulty of segmenting the ambiguous words, the grammatical phenomena are summarized which lie in the typical ambiguity, and the codes library of different parts of speech used for coding is built up. On this basis, words in ambiguity fields with special grammatical rules are set with codes and transformed to the representation form of inputting vector which can be accepted by the neural network. Then the samples are trained and the grammatical rules can be obtained by improving the self-learning of BP neural network. After a lot of training through adopting the BP network, the algorithm reaches 93. 13% of training precision and 92.50% of test precision on ambiguous words segmentation.

关 键 词:文本挖掘 歧义字段 自然语言处理 神经网络 

分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象