基于表示学习的中文分词  被引量:5

Chinese word segment based on character representation learning

在线阅读下载全文

作  者:刘春丽[1] 李晓戈[1] 刘睿[1] 范贤[1] 杜丽萍[1] 

机构地区:[1]西安邮电大学计算机学院,西安710121

出  处:《计算机应用》2016年第10期2794-2798,共5页journal of Computer Applications

基  金:国家自然科学基金资助项目(61373116);陕西省普通高等学校重点学科专项资金资助项目(112-1602);西安邮电大学研究生创新基金资助项目(ZL2013-30)~~

摘  要:为提高中文分词的准确率和未登录词(OOV)识别率,提出了一种基于字表示学习方法的中文分词系统。首先使用Skip-gram模型将文本中的词映射为高维向量空间中的向量;其次用K-means聚类算法将词向量聚类,并将聚类结果作为条件随机场(CRF)模型的特征进行训练;最后基于该语言模型进行分词和未登录词识别。对词向量的维数、聚类数及不同聚类算法对分词的影响进行了分析。基于第四届自然语言处理与中文计算会议(NLPCC2015)提供的微博评测语料进行测试,实验结果表明,在未利用外部知识的条件下,分词的F值和OOV识别率分别达到95.67%和94.78%,证明了将字的聚类特征加入到条件随机场模型中能有效提高中文短文本的分词性能。In order to improve the accuracy and the Out Of Vocabulary (OOV) recognition rate of the Chinese word segmentation, a Chinese word segmentation system based on character representation learning method was proposed. Firstly, the word in the text was mapped to a vector in a high-dimentioanl vecter space using Skip-gram model; then the K-means clustering algorithm was used to acquire clusters of the word vector, and the clustering results were regarded as features of Conditional Random Fields (CRF) model for training. Finally the CRF model was used for word segmentation and OOV recognition. The influences of the word vector dimensions, the number of clusters and different cluster algorithm on word segmentation were analyzed. Experiments were conducted on the 4th CCF Conference on Natural Language Processing & Chinese Computing (NLPCC2015) corpus. Experimental results show that the proposed system can effectively improve Chinese short text segmentation performance without using external knowledge, the F-value and the OOV recognition rate achieve to 95.67% and 94.78% respectively.

关 键 词:表示学习 词向量 聚类 条件随机场 中文分词 

分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象