基于无监督模型的低资源词性标注实验对比和分析  

An empirical comparison and analysis on low-resource POS taggingapproaches based on unsupervised model

在线阅读下载全文

作  者:李扬 周厚全 李正华[1] 张民[1] LI Yang;ZHOU Houquan;LI Zhenghua;ZHANG Min(School of Computer Science&Technology,Soochow University,Suzhou 215006,China)

机构地区:[1]苏州大学计算机科学与技术学院,江苏苏州215006

出  处:《厦门大学学报(自然科学版)》2024年第2期221-231,共11页Journal of Xiamen University:Natural Science

基  金:国家自然科学基金(62176173,61876116)。

摘  要:[目的]研究无监督词性标注模型在低资源语言上的性能表现.[方法]尝试利用无监督词性标注模型,包括高斯隐马尔科夫模型(Gaussian HMM,GHMM)、最大化互信息模型(mutual information maximization, MIM)与条件随机场自编码器(conditional random filed autoencoder, CRF-AE),展开低资源词性标注实验.基于对前人工作的凝练,在英文宾州树库上设置了少样本和词典标注两种低资源场景.[结果]无监督词性标注模型能够在少样本场景中超越条件随机场模型,但在词典标注场景中却始终逊色于条件随机场模型.[结论]无监督损失更加擅长对高频词进行建模,使得模型在少样本场景下获得更好的性能表现;同时无监督损失倾向于生成更加均匀的词性分布,从而降低模型在词典标注场景下的性能.[Objective]Part-of-speech(POS)tagging aims to grammatically categorize each word in a sentence with a corresponding POS tag.While the performance of POS tagging models in rich-resource scenarios has indeed advanced,the room for improvements remains in low-resource scenarios,including the few-sample scenario and the dictionary-labeling scenario.Previous research primarily focused on enhancing models from the perspective of training data,with limited attention paid to the model itself.In this paper,we tackle this issue from the perspective of the model and attempt to leverage the unsupervised model so that unlabeled data can be learned.[Methods]Based on the work of predecessors,we set up few-sample scenario and dictionary-labeling scenario.Then,we selected several representative unsupervised POS tagging models,which included Gaussian hidden Markov models(GHMM),mutual information maximization(MIM),and conditional random field autoencoder(CRF-AE).Modifying the training objective functions of these models,we adapted them to those two low-resource scenarios set up by us.Additionally,we chose the traditional supervised POS tagging model,namely conditional random fields(CRF),as the baseline model for comparison.[Results]We conduct experiments on the Penn Tree Bank dataset,which is widely used for unsupervised POS tagging.In the few-sample scenario,MIM achieves the highest performance under the minimal sample size setting,and CRF-AE consistently outperforms CRF when pre-training language models are not employed.However,as the sample size increases,the performance advantage of CRF-AE over CRF diminishes,and the performance of GHMM and MIM also gradually declines in comparison to CRF.After applying the pre-trained language models to both CRF-AE and CRF,significant improvements in model performance are observed,but CRF-AE continues to outperform the CRF model in scenarios with limited and minimal samples.In the dictionary-labeling scenario,CRF consistently achieves the best results across all settings.In contrast,the perform

关 键 词:词性标注 低资源学习 词典标注 无监督学习 

分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象