一种基于领域知识的检索增强生成方法  

A retrieval-augmented generation method based on domain knowledge

作  者:张高飞 李欢 池云仙 赵巧红 勾智楠 高凯[1] ZHANG Gaofei;LI Huan;CHI Yunxian;ZHAO Qiaohong;GOU Zhinan;GAO Kai(School of Information Science and Engineering,Hebei University of Science and Technology,Shijiazhuang,Hebei 050018,China;Hebei Vocational College of Rail Transportation,Shijiazhuang,Hebei 050801,China;School of Management Science and Information Engineering,Hebei University of Economics and Business,Shijiazhuang,Hebei 050061,China)

机构地区:[1]河北科技大学信息科学与工程学院,河北石家庄050018 [2]河北轨道运输职业技术学院,河北石家庄050801 [3]河北经贸大学管理科学与信息工程学院,河北石家庄050061

出  处:《河北工业科技》2025年第2期103-110,196,共9页Hebei Journal of Industrial Science and Technology

基  金:河北省自然科学基金(F2022208006,F2023207003);河北省高等学校科学技术研究项目(QN2024196)。

摘  要:为了提高当前大语言模型(large language model,LLM)在利用检索文档生成答案时的准确性,提出一种基于领域知识的检索增强生成(retrieval-augmented generation,RAG)方法。首先,在检索过程中通过问题和领域知识进行第1层的稀疏检索,为后续的稠密检索提供领域数据集;其次,在生成过程中采用零样本学习的方法,将领域知识拼接在问题之前或之后,并与检索文档结合,输入到大语言模型中;最后,在医疗领域和法律领域数据集上使用大语言模型ChatGLM2-6B、Baichuan2-7B-chat进行多次实验,并进行性能评估。结果表明:基于领域知识的检索增强生成方法能够有效提高大语言模型生成答案的领域相关度,并且零样本学习方法相较于微调方法表现出更好的效果;采用零样本学习方法时,融入领域知识的稀疏检索和领域知识前置方法在ChatGLM2-6B上取得了最佳提升效果,与基线方法相比,ROUGE-1、ROUGE-2和ROUGE-L评分分别提高了3.82、1.68、4.32个百分点。所提方法能够提升大语言模型生成答案的准确性,为开放域问答的研究和应用提供重要参考。In order to enhance the accuracy of current large language model(LLM)in generating answers using retrieval documents,a retrieval-augmented generation method based on domain knowledge was proposed.Firstly,during the retrieval process,the first layer of sparse retrieval was conducted using both the question and domain knowledge,providing a domain-specific dataset for subsequent dense retrieval.Secondly,in the generation process,a zero-shot learning method was employed to concatenate domain knowledge before or after the question,and combined it with the retrieved documents to input into the large language model.Finally,extensive experiments were conducted on datasets in the medical and legal domains using ChatGLM2-6B and Baichuan2-7B-chat,and performance evaluations were conducted.The results indicate that the retrieval-augmented generation method based on domain knowledge can effectively improve the domain relevance of the answers generated by large language models,and the zero-shot learning method outperforms the fine-tuning method.When the zero-shot learning method is used,the sparse retrieval incorporating domain knowledge and the method of placing domain knowledge before the question achieve the best improvement on ChatGLM2-6B,with ROUGE-1,ROUGE-2 and ROUGE-L scores increasing by 3.82,1.68 and 4.32 percentage points respectively compared to the baseline method.The proposed method can enhance the accuracy of the answers generated by large language models and provide an important reference for the research and application of open-domain question answering.

关 键 词:自然语言处理 开放域问答 检索增强生成 大语言模型 零样本学习 领域知识 

分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象