无监督分词算法在新词识别中的应用  被引量:2

Application of Unsupervised Word Segmentation Algorithm in New Word Recognition

在线阅读下载全文

作  者:姜涛 陆阳[1,2] 张洁[3] 洪建[3] JIANG Tao;LU Yang;ZHANG Jie;HONG Jian(School of Computer Science and Information Engineering,Hefei University of Technology,Hefei 230601,China;Engineering Research Center of Safety Critical Industry Measure and Control Technology,Ministry of Education,Hefei 230601,China;Information Center,The First Affiliated Hospital of Anhui Medical University,Hefei 230022,China)

机构地区:[1]合肥工业大学计算机与信息学院,合肥230601 [2]安全关键工业测控技术教育部工程研究中心,合肥230601 [3]安徽医科大学第一附属医院信息中心,合肥230022

出  处:《小型微型计算机系统》2020年第4期888-892,共5页Journal of Chinese Computer Systems

基  金:安徽省教育厅重点项目(SK2018A0154)资助;国家重点研发计划专项项目(2016YFC0801804)资助。

摘  要:新词识别过程中,使用分词工具进行预分词的方法,受限于训练语料而对某些领域的分词准确率不佳.针对这个问题,本文提出了一种改进方法.该方法首先基于元语言模型进行无监督预分词,再将词频、互信息和邻接熵作为主要特征进行新词发现.同时方法中还结合了命名实体识别对发现的结果进行过滤,得到候选词组后使用网格搜索寻找最优的超参数组合.实验选取四种不同领域的语料,在统一的超参数下,前10%的新词准确率分别达到了88.3%、80.5%、85.9%、91.9%.实验表明,这种无监督的分词方法适用于新词识别领域,并具备良好的领域适应性.In new word recognition,the method which uses the common tools for word segmentation is not good in some fields because of the specific training corpus.This paper proposes an improved method for the problem.Firstly,we segment the word for an unsupervised method based on N-gram language model,and then use some features to discover new words including word frequency,mutual information and branch entropy.At the same time,the method also combines the named entity recognition to filter the results.And after obtaining the candidate words,the grid search method is used to find the optimal hyperparameter combination.We selected four different fields of corpus in the experiment.Under the same hyperparameters,the accuracy of the top 10%of new words reached 88.3%,80.5%,85.9%,and 91.9%,respectively.Experiments show that this unsupervised word segmentation method is available and has a good adaptability in the new word recognition.

关 键 词:新词识别 互信息 邻接熵 N元语言模型 中文分词 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象