越南语分词词典半监督集成构建算法  

Semi-supervised Ensemble Construction Algorithm of Vietnamese Word-segmentation Dictionary

在线阅读下载全文

作  者:刘伍颖 王琳 LIU Wuying;WANG Lin(Laboratory of Language Engineering and Computing,Guangdong University of Foreign Studies, Guangzhou 510420,China;Xianda College of Economics and Humanities,Shanghai International Studies University,Shanghai 200083,China)

机构地区:[1]广东外语外贸大学语言工程与计算实验室,广东广州510420 [2]上海外国语大学贤达经济人文学院,上海200083

出  处:《郑州大学学报(理学版)》2018年第1期60-65,共6页Journal of Zhengzhou University:Natural Science Edition

基  金:国家语委重点项目(ZDI135-26);广东省高校特色创新项目(2015KTSCX035);广东省哲学社会科学重点实验室招标项目(LEC2017WTKT002)

摘  要:针对越南语分词词典构建问题,提出了一种新的半监督集成构建方法.该方法能够结合人工干预,从大规模未标注越南语语料中识别多音节单词.首先设计了一种n元音节词产生器,并生成尽可能多的候选多音节词;其次通过3种统计特征的计算并根据预设阈值实现了相应的单词提取器,接着越南语专家检测并修正3个单独的词典;最后词典合成器合并这几个提取出的词典形成一个集成词典.采用直接实验和间接实验来评估这些词典的效力,实验结果表明,所提出的半监督集成构建方法是有效的,而且采用这些动态提取词典的两种越南语分词算法都能够达到理想的性能.Considering the construction problem of Vietnamese word-segmentation dictionary,a novel semi-supervised ensemble construction method was proposed,which could detect multisyllabic words from a large-scale unlabeled Vietnamese corpus after a manual intervention.Firstly,a syllable-level n-gram word generator to build as many as possible candidate multisyllabic words was designed.Secondly,three statistical features were calculated,and related word extractors were implemented according to the preset threshold.Subsequently,three individual dictionaries were checked and corrected by Vietnamese experts.Finally,the dictionary combiner merged several dictionaries extracted by the extractors to form the ensemble one.The effectiveness of these dictionaries were evaluated through a direct experiment and an indirect experiment.The experimental results showed that semi-supervised ensemble construction method was effective,and the two Vietnamese word-segmentation algorithms with dynamically extracted dictionaries could achieve comparable performance.

关 键 词:半监督集成构建 分词词典 多音节词 n元音节词 越南语 

分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象