基于非参数贝叶斯模型和深度学习的古文分词研究  被引量:19

Word Segmentation for Ancient Chinese Texts Based on Nonparametric Bayesian Models and Deep Learning

在线阅读下载全文

作  者:俞敬松[1] 魏一 张永伟[2] 杨浩 YU Jingsong;WEI Yi;ZHANG Yongwei;YANG Hao(School of Software and Microelectronics,Peking University,Beijing 100871,China;Institute of Linguistics,Chinese Academy of Social Sciences,Beijing 100732,China;Editorial and Research Center of Confucian Canon,Peking University,Beijing 100871,China)

机构地区:[1]北京大学软件与微电子学院,北京100871 [2]中国社会科学院语言研究所,北京100732 [3]北京大学儒藏编纂与研究中心,北京100871

出  处:《中文信息学报》2020年第6期1-8,共8页Journal of Chinese Information Processing

基  金:国家自然科学基金(61876004)

摘  要:古汉语文本中,汉字通常连续书写,词与词之间没有明显的分割标记,为现代人理解古文乃至文化传承带来许多障碍。自动分词是自然语言处理技术的基础任务之一。主流的自动分词方法需要大量人工分词语料训练,费时费力,古文分词语料获取尤其困难,限制了主流自动分词方法的应用。该文将非参数贝叶斯模型与BERT(Bidirectional Encoder Representations from Transformers)深度学习语言建模方法相结合,进行古文分词研究。在《左传》数据集上,该文提出的无监督多阶段迭代训练分词方法获得的F1值为93.28%;仅使用500句分词语料进行弱监督训练时,F1值可达95.55%,高于前人使用6/7语料(约36 000句)进行有监督训练的结果;使用相同规模训练语料时,该文方法获得的F1值为97.40%,为当前最优结果。此外,该文方法还具有较好的泛化能力,模型代码已开源发布。All the Chinese characters in ancient Chinese texts are written continuously, without obvious segmentation marks between words. This brings great challenges to text understanding and even cultural inheritance. To deal with word segmentation for ancient Chinese texts, we propose the Multi-Stage Iterative Training(MSIT) for unsupervised word segmentation by combining non-parametric Bayesian models with BERT(Bidirectional Encoder Representations from Transformers). It achieves the F1 score of 93.28% on Zuozhuan(an ancient Chinese history book) dataset. After adding only 500 ground truth sentences, which can be considered as weakly supervised learning, the F1 score reaches 95.55%. It outperforms the previous best result, which trains on 6/7 of the Zuozhuan dataset(about 36,000 ground truth sentences). When using the same training set, our method gets the F1 score of 97.40%, the state-of-the-art result. Our proposed method is not only better than traditional sequence labeling algorithms including BERT model, but also proved that it has better generalization ability by experiments. The model and related codes are available online.

关 键 词:古文分词 非参数贝叶斯模型 深度学习 无指导学习 弱指导学习 

分 类 号:TP391[自动化与计算机技术—计算机应用技术] TP18[自动化与计算机技术—计算机科学与技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象