基于大语言模型的多输入中文拼写纠错方法  

Chinese spelling correction method based on LLM with multiple inputs

作  者:马灿 黄瑞章 任丽娜[1,2,3] 白瑞娜[1,2] 伍瑶瑶 MA Can;HUANG Ruizhang;REN Lina;BAI Ruina;WU Yaoyao(Engineering Research Center of Ministry of Education for Text Computing and Cognitive Intelligence(Guizhou University),Guiyang Guizhou 550025,China;College of Computer Science and Technology,Guizhou University,Guiyang Guizhou 550025,China;Department of Information Engineering,Guizhou Light Industry Technical College,Guiyang Guizhou 550025,China)

机构地区:[1]文本计算与认知智能教育部工程研究中心(贵州大学),贵阳550025 [2]贵州大学计算机科学与技术学院,贵阳550025 [3]贵州轻工职业技术学院信息工程系,贵阳550025

出  处:《计算机应用》2025年第3期849-855,共7页journal of Computer Applications

基  金:国家自然科学基金资助项目(62066007);贵州省科技支撑计划项目(2022277)。

摘  要:中文拼写纠错(CSC)是自然语言处理(NLP)中的一项重要研究任务,现有的基于大语言模型(LLM)的CSC方法由于LLM的生成机制,会生成和原文存在语义偏差的纠错结果。因此,提出基于LLM的多输入CSC方法。该方法包含多输入候选集合构建和LLM纠错两阶段:第一阶段将多个小模型的纠错结果构建为多输入候选集合;第二阶段使用LoRA(Low-Rank Adaptation)对LLM进行微调,即借助LLM的推理能力,在多输入候选集合中预测出没有拼写错误的句子作为最终的纠错结果。在公开数据集SIGHAN13、SIGHAN14、SIGHAN15和修正后的SIGHAN15上的实验结果表明,相较于使用LLM直接生成纠错结果的方法Prompt-GEN-1,所提方法的纠错F1值分别提升了9.6、24.9、27.9和34.2个百分点,相较于表现次优的纠错小模型,所提方法的纠错F1值分别提升了1.0、1.1、0.4和2.4个百分点,验证了所提方法能提升CSC任务的效果。Chinese Spelling Correction(CSC)is an important research task in Natural Language Processing(NLP).The existing CSC methods based on Large Language Models(LLMs)may generate semantic discrepancies between the corrected results and the original content.Therefore,a CSC method based on LLM with multiple inputs was proposed.The method consists of two stages:multi-input candidate set construction and LLM correction.In the first stage,a multi-input candidate set was constructed using error correction results of several small models.In the second stage,LoRA(Low-Rank Adaptation)was employed to fine-tune the LLM,which means that with the aid of reasoning capabilities of the LLM,sentences without spelling errors were deduced from the multi-input candidate set and used as the final error correction results.Experimental results on the public datasets SIGHAN13,SIGHAN14,SIGHAN15 and revised SIGHAN15 show that the proposed method has the correction F1 value improved by 9.6,24.9,27.9,and 34.2 percentage points,respectively,compared to the method Prompt-GEN-1,which generates error correction results directly using an LLM.Compared with the sub-optimal error correction small model,the proposed method has the correction F1 value improved by 1.0,1.1,0.4,and 2.4 percentage points,respectively,verifying the proposed method’s ability to enhance the effect of CSC tasks.

关 键 词:中文拼写纠错 大语言模型 模型集成 模型微调 提示学习 

分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象