基于BERT的开放领域中文新词发现研究  被引量:2

DISCOVERY OF CHINESE NEW WORDS IN OPEN DOMAIN BASED ON BERT

在线阅读下载全文

作  者:刘凡平 陈慧 沈振雷 吴业俭 Liu Fanping;Chen Hui;Shen Zhenlei;Wu Yejian(Shanghai 2345 Network Technology Co.,Ltd.,Shanghai 201203,China)

机构地区:[1]上海二三四五网络科技有限公司,上海201203

出  处:《计算机应用与软件》2023年第6期173-180,共8页Computer Applications and Software

摘  要:针对当前新词发现准确率低、可移植性不强和需要大规模语料等问题,提出一种基于BERT的开放领域新词识别方法。利用BERT对句意的较强理解能力,将词语和上下文输入模型,训练词语识别器;将测试文本按字节流进行大小为N的滑动窗口操作形成若干候选词。针对候选词进行分类,识别判定其在上下文中是否属于一个词,倘若该词未在标准词库中出现,则为新词。将该方法与基于互信息和左右熵的新词发现方法和基于条件随机场的新词发现方法进行效果对比,结果表明该方法具有更高的精准率和F1值,同时对于命名体的识别也拥有更高的召回率。Aiming at the problems of low accuracy,low portability and large-scale corpus,this paper proposes an open domain new word detection method based on BERT.By using the strong understanding ability of the sentence meaning of BERT,the word and context were input into the model to train the word recognizer.The test text was operated by sliding window with the size of N according to the byte stream to form several candidate words.The candidate words were classified to determine whether they belonged to a word in the context.If the word did not appear in the standard thesaurus,it was a new word.Compared with the new word discovery method based on mutual information and left and right entropy and conditional random field,the results show that this method has higher accuracy and F1 value,and has higher recall rate for the recognition of named objects.

关 键 词:BERT 新词发现 分类器 

分 类 号:TP3[自动化与计算机技术—计算机科学与技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象