无监督与有监督相结合的粤语分词方法  

Cantonese word segmentation combining unsupervised and supervised method

在线阅读下载全文

作  者:苏振江 张仰森[1,2] 胡昌秀 黄改娟 SU Zhen-jiang;ZHANG Yang-sen;HU Chang-xiu;HUANG Gai-juan(Institute of Intelligent Information Processing,Beijing Information Science and Technology University,Beijing 100192,China;Beijing Laboratory of National Economic Security Early-Warming Engineering,Beijing Jiaotong University,Beijing 100044,China)

机构地区:[1]北京信息科技大学智能信息处理研究所,北京100192 [2]北京交通大学国家经济安全预警工程北京实验室,北京100044

出  处:《计算机工程与设计》2023年第8期2482-2488,共7页Computer Engineering and Design

基  金:国家自然科学基金项目(61772081);科技创新服务能力建设—科研基地建设—北京实验室—国家经济安全预警工程北京实验室基金项目(PXM2018_014224_000010)。

摘  要:为能在缺乏粤语分词语料的情况下进行粤语研究,提出一种基于无监督与有监督结合的粤语分词方法。利用多源语料完成粤语词库的构建;利用二元字典与粤语词库对初步结果进行初筛分词和二次分词;利用DAG对粤语通用句式切分错误进行分析并修正;将修正后的粤语分词语料利用深度学习模型固化分词效果,得到基于Bert-BiLSTM-CRF三层架构的分词模型。实验结果表明,该方法能有效克服预分词语料的缺失问题,在无需大量分词语料的情况下,F值达到74.3%。To study Cantonese in the absence of Cantonese word segmentation corpus,a Cantonese word segmentation method based on the combination of unsupervised and supervised was proposed.The construction of Cantonese thesaurus was completed using multi-source corpus.The binary dictionary and Cantonese thesaurus were used to screen and twice segment the preliminary results.DAG was used to analyze and correct the segmentation errors of Cantonese general sentence patterns.The modified Cantonese word segmentation corpus was used to solidify the word segmentation effect using the deep learning model,and a word segmentation model based on Bert-BiLSTM-CRF three-tier architecture was obtained.Experimental results show that this method can effectively overcome the lack of word segmentation data.Without a large number of word segmentation corpus,the F value reaches 74.3%.

关 键 词:粤语 分词研究 词库 互信息 端到端模型 有监督模型 无监督模型 

分 类 号:TP183[自动化与计算机技术—控制理论与控制工程]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象