Ko⁃LLaMA:基于LLaMA的朝鲜语大语言模型  

Ko⁃LLaMA:A Korean Large Language Model Based on LLaMA

在线阅读下载全文

作  者:庞杰 闫晓东[1,2,3] 赵小兵 Pang Jie;Yan Xiao-dong;Zhao Xiao-bing(Minzu University of China,Beijing 100081,China;National Language Resources Monitoring and Research Center for Minority Languages,Beijing 100081,China;Key Laboratory of Ethnic Language Intelligent Analysis and Security Governance of MOE,Beijing 100081,China)

机构地区:[1]中央民族大学,北京100081 [2]国家语言资源监测与研究民族语言中心,北京100081 [3]民族语言智能与安全治理教育部重点实验室,北京100081

出  处:《外语学刊》2025年第1期1-8,共8页Foreign Language Research

基  金:国家社科基金重点项目“多民族语言《十三经》跨学科研究及数据库建设”(22&ZD035)的阶段性成果。

摘  要:在本文中,我们通过扩展LLaMA现有的词表,增加额外的20,000个朝鲜语Token,从而提高其对朝鲜语的编码和语义理解的能力;并且进一步使用朝鲜语数据进行继续预训练,使用朝鲜语指令微调数据集对模型进行SFT(Supervised Fine⁃Tuning),并分析不同数据量对指令精调效果的影响,经过继续预训练和指令微调后的模型显著提高了理解和遵循朝鲜语指令的能力。实验结果表明,新提出的模型Ko⁃LLaMA显著提高了原版LLaMA在理解和生成朝鲜语内容方面的能力。此外,在朝鲜语文本分类数据集YNAT上对Ko⁃LLaMA与擅长少数民族语言的CINO模型及CINO的多种模型组合以及原版LLaMA和GPT⁃3.5进行了效果对比。结果表明,Ko⁃LLaMA的朝鲜语文本分类能力远超CINO和CINO的组合模型以及LLaMA和GPT⁃3.5等未经过朝鲜语语料进行词表扩充和继续预训练的大语言模型。Large language models have gained immense popularity in the last couple of years,with models like ChatGPT and GPT⁃4 revolutionizing natural language processing research and taking exciting steps towards artificial general intelligence(AGI).De⁃spite several large language models being open⁃sourced,such as LLaMA,these models primarily focus on English and Chinese corpora,with limited applicability to other languages.For minority languages such as Korean,the applicability of large language models is even more limited.In this paper,we enhance the applicability of LLaMA to the Korean language by extending its existing vocabulary with an additional 20,000 Korean tokens,improving its ability to encode and semantically understand Korean.We further continue pre⁃training the model with Korean data,fine⁃tune the model with a Korean instruction dataset(SFT:Supervised Fine⁃Tuning),and analyze the impact of varying amounts of data on the fine⁃tuning effect.The model after continued pretraining and instruction fine⁃tuning significantly improves the model's ability to understand and execute Korean instructions.With the proposed approach,the capability of LLaMA to understand and generate Korean text is greatly enhanced,and its ability to follow instructions is strengthened.Experimental results show that the newly proposed model,Ko⁃LLaMA,significantly outperforms the original LLaMA in terms of understanding and generating Korean content.Furthermore,in the comparison of effectiveness on the YNAT dataset for fresh language text classification,Ko⁃LLaMA was compared against the CINO model,which excels in minority languages,along with various combinations of CINO models,original LLaMA,and GPT⁃3.5.The results indicate that KoLLaMA's ability in classifying Korean text far surpasses that of CINO and its combinations,as well as LLaMA and GPT⁃3.5,which have not undergone vocabulary expansion and continued pre⁃training on Korean language corpora.

关 键 词:朝鲜语 大语言模型 词表扩充 继续预训练 指令微调 

分 类 号:H08[语言文字—语言学]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象