检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:汪文妃 徐豪杰 杨文珍[1] 吴新丽[1] WANG Wenfei;XU Haojie;YANG Wenzhen;WU Xinli(School of Mechanical Engineering and Automation,Zhejiang Sci-Tech University,Hangzhou 310018,China)
机构地区:[1]浙江理工大学机械与自动控制学院,浙江杭州310018
出 处:《成组技术与生产现代化》2018年第3期1-8,共8页Group Technology & Production Modernization
基 金:国家自然科学基金重点资助项目(61332017);国家重点研发计划资助项目(2017YFB1002803;2018YFB1004901);浙江省自然科学基金重点资助项目(LZ14E050003);广州市创新创业领军团队资助项目(CXLJTD-201609)
摘 要:针对制约中文分词算法效能的歧义消除和未登录词识别两大瓶颈,归纳和总结近年来基于词典、基于统计以及基于语义理解中文分词算法的研究内容.基于词典的分词算法以提高时间和空间效率为目标,通过改进词典结构来提高分词效率.双字哈希结构是目前查词性能较好的词典机制,但对于歧义消除和未登录词识别的贡献度有限.基于统计的分词算法通过改进统计语言概率模型,在一定程度上可消除中文分词的歧义,较好地识别出未登录词.条件随机场模型(CRF)综合了隐马尔科夫模型(HMM)和最大熵模型(ME)的特征,是目前基于统计分词算法的主流训练模型.随着神经网络的研究应用,基于语义理解的分词算法对歧义消除和未登录词识别表现出较好的性能,能够提高中文分词的正确率.未来中文分词算法将更多地围绕上下文语义开展研究,运用深度学习技术进一步提升歧义消除和未登录词识别的能力,从而提高中文分词的正确率.How to improve the correct rate of Chinese word segmentation is the core concern of Chinese word segmentation algorithm.Among them,the elimination of ambiguity and the recognition of unregistered words are the two major bottlenecks to limit the effectiveness of Chinese word segmentation algorithms.This paper focuses on these two major bottlenecks,sums up and summarizes the advantages and disadvantages of the three Chinese word segmentation algorithms based on dictionary,statistics,and semantic understanding,and proposes the development trend of Chinese word segmentation algorithms.Word segmentation algorithm based on the dictionary to improve the time and space efficiency as the goal by improving the dictionary structure to enhance the efficiency of word segmentation,The double-word hash structure is a dictionary mechanism with better search word performance,but the degree of contribution to ambiguity elimination and unregistered word recognition is limited.Word segmentation algorithm based on statistical,by improving statistical language probabilistic model,the ambiguity of Chinese word segmentation can be eliminated to a certain extent,and unregistered words can be well recognized.The CRF model integrates the HMM and maximum entropy features.It is a mainstream training model almong segmentation algorithm based on the statistical.With the research and application of neural networks,the segmentation algorithm based on semantic understanding shows better performance in disambiguation and recognition of unregistered words,which improves the accuracy of Chinese word segmentation.Future Chinese word segmentation algorithms will focus on contextual semantics and use deep learning techniques to further improve the ability to eliminate ambiguity and unregistered words,thus improving the accuracy of Chinese word segmentation.
关 键 词:中文分词 歧义消除 未登录词识别 词典机制 语义理解 深度学习
分 类 号:TP312[自动化与计算机技术—计算机软件与理论]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.143