检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:祝钰莹 郭燕 万亿兆[2] 田凯 ZHU Yuying;GUO Yan;WAN Yizhao;TIAN Kai(Suzhou Institute for Advanced Research,University of Science and Technology of China,Suzhou,Jiangsu 215123,China;School of Software Engineering,University of Science and Technology of China,Suzhou,Jiangsu 215123,China)
机构地区:[1]中国科学技术大学苏州高等研究院,江苏苏州215123 [2]中国科学技术大学软件学院,江苏苏州215123
出 处:《计算机科学》2023年第7期221-228,共8页Computer Science
摘 要:新词发现是中文自然语言处理的基本任务,对于提升各种下游任务的性能至关重要。文中提出了一种基于信息熵-切分概率模型的新词发现方法,该方法首先基于信息熵从待处理文本中生成候选词集,然后对候选词集进行切分概率计算,从而筛选出真正的新词。针对有无待处理文本相关的标注语料,提出了两种不同的模型。在缺少待处理文本相关标注语料的情况下,使用通用的分词基准数据集训练了多标签Transformer-CRF模型;在具有专业标注语料的情况下,则引入了基于键值的记忆神经网络,以充分融合词语成词信息。实验结果表明,多标签Transformer-CRF模型在Top900词中法律相关词的MAP高达54.00%,较无监督方法生成的候选词集提升了2.15%;在对法律专业语料提取新词时,键值记忆神经网络的表现进一步超过了多标签Transformer-CRF模型,达到了3.43%的效果提升。As a basic task of Chinese natural language processing,new word detection is crucial for improving the performance of various downstream tasks.This paper proposes a new word detection method based on branch entropy and segmentation probabi-lity.The method firstly generates a candidate word set from the text based on branch entropy,and then calculates the segmentation probability of each candidate,so as to filter out the noisy word.Two different models are proposed to respectively deal with situations whether or not there are annotated corpus related to the text to be processed.In the absence of related segmented corpus,the multi-criteria Transformer-CRF model is trained using general segmented benchmark data sets.A key-value based memory neural network is introduced to fully extract the wordhood information if there is field-specific segmented corpus.Experimental results show that the multi-criteria Transformer-CRF model has a MAP of 54.00%of legal texts in the top 900 resulted words,which is 2.15%higher than that of the unsupervised method.As with segmented legal corpus,the performance of the key-value memory neural network further exceeds the former model,has an improvement of 3.43%.
关 键 词:新词发现 信息熵 互信息 Transformer 条件随机场 键值记忆神经网络
分 类 号:TP391[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.7