中文分词算法研究综述被引量：12

Review of Chinese Word Segmentation Algorithms

作　　者：汪文妃徐豪杰杨文珍[1] 吴新丽[1] WANG Wenfei;XU Haojie;YANG Wenzhen;WU Xinli(School of Mechanical Engineering and Automation,Zhejiang Sci-Tech University,Hangzhou 310018,China)

机构地区：[1]浙江理工大学机械与自动控制学院,浙江杭州310018

出　　处：《成组技术与生产现代化》2018年第3期1-8,共8页Group Technology & Production Modernization

基　　金：国家自然科学基金重点资助项目(61332017);国家重点研发计划资助项目(2017YFB1002803;2018YFB1004901);浙江省自然科学基金重点资助项目(LZ14E050003);广州市创新创业领军团队资助项目(CXLJTD-201609)

摘　　要：针对制约中文分词算法效能的歧义消除和未登录词识别两大瓶颈,归纳和总结近年来基于词典、基于统计以及基于语义理解中文分词算法的研究内容.基于词典的分词算法以提高时间和空间效率为目标,通过改进词典结构来提高分词效率.双字哈希结构是目前查词性能较好的词典机制,但对于歧义消除和未登录词识别的贡献度有限.基于统计的分词算法通过改进统计语言概率模型,在一定程度上可消除中文分词的歧义,较好地识别出未登录词.条件随机场模型(CRF)综合了隐马尔科夫模型(HMM)和最大熵模型(ME)的特征,是目前基于统计分词算法的主流训练模型.随着神经网络的研究应用,基于语义理解的分词算法对歧义消除和未登录词识别表现出较好的性能,能够提高中文分词的正确率.未来中文分词算法将更多地围绕上下文语义开展研究,运用深度学习技术进一步提升歧义消除和未登录词识别的能力,从而提高中文分词的正确率.How to improve the correct rate of Chinese word segmentation is the core concern of Chinese word segmentation algorithm.Among them,the elimination of ambiguity and the recognition of unregistered words are the two major bottlenecks to limit the effectiveness of Chinese word segmentation algorithms.This paper focuses on these two major bottlenecks,sums up and summarizes the advantages and disadvantages of the three Chinese word segmentation algorithms based on dictionary,statistics,and semantic understanding,and proposes the development trend of Chinese word segmentation algorithms.Word segmentation algorithm based on the dictionary to improve the time and space efficiency as the goal by improving the dictionary structure to enhance the efficiency of word segmentation,The double-word hash structure is a dictionary mechanism with better search word performance,but the degree of contribution to ambiguity elimination and unregistered word recognition is limited.Word segmentation algorithm based on statistical,by improving statistical language probabilistic model,the ambiguity of Chinese word segmentation can be eliminated to a certain extent,and unregistered words can be well recognized.The CRF model integrates the HMM and maximum entropy features.It is a mainstream training model almong segmentation algorithm based on the statistical.With the research and application of neural networks,the segmentation algorithm based on semantic understanding shows better performance in disambiguation and recognition of unregistered words,which improves the accuracy of Chinese word segmentation.Future Chinese word segmentation algorithms will focus on contextual semantics and use deep learning techniques to further improve the ability to eliminate ambiguity and unregistered words,thus improving the accuracy of Chinese word segmentation.

关键词：中文分词歧义消除未登录词识别词典机制语义理解深度学习

分类号：TP312[自动化与计算机技术—计算机软件与理论]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

中文分词算法研究综述被引量：12

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

中文分词算法研究综述 被引量：12

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

中文分词算法研究综述被引量：12