Fine Tuning Language Models:A Tale of Two Low-Resource Languages  

在线阅读下载全文

作  者:Rosel Oida-Onesa Melvin A.Ballera 

机构地区:[1]Technologic al Institute of the Philippines Manila,Casal,Manila 1000,Philippines [2]Camarines Sur Polytechnic Colleges Nabua,Camarines Sur,Camarines Sur 4434,Philippines

出  处:《Data Intelligence》2024年第4期946-967,共22页数据智能(英文)

摘  要:Creating a parallel corpus for machine translation is a challenging and time-consuming task,especially in a linguistically diverse country like the Philippines,with 185 languages.Although a wealth of text is available,annotated data is scarce,particularly for languages like Bikol.Bikol is one of the major languages in the Philippines;however,its underrepresentation in the digital sphere is attributed to the absence of annotated data.This study outlines the development process of BFParCo,a proposed gold standard dataset for the Bikol and Filipino parallel corpus.The corpus underwent refinement through manual phrase alignment,translation,and evaluation.Subsequently,T5 and mT5 transformer models were fine-tuned with the parallel corpus and were evaluated using the BLEU metric.The results showed a notable improvement in Bilingual Evaluation Understudy(BLEU)score after fine-tuning,with an increase of 60.68 in BIK→FIL and 58.93 in FIL→BIK translations.Additionally,human evaluators comprehensively assessed the fine-tuned models'results using Multidimensional Quality Metrics and Scalar Quality Metrics error taxonomies.The fine-tuned models then were made publicly accessible through Hugging Face.This study represents a significant stride in advancing machine translation tools for Bikol and Filipino languages.

关 键 词:Natural language processing Language models Transfer learning Fine-tuning Low resource language Bikol FILIPINO 

分 类 号:TP391.2[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象