面向海量数据的高效流水化检索增强生成系统  

Efficient pipeline for retrieval-augmented generation system under big data

作  者:余润杰 阳羽凡 周健 吴非[1,2] Runjie YU;Yufan YANG;Jian ZHOU;&Fei WU(Wuhan National Laboratory for Optoelectronics,Huazhong University of Science and Technology,Wuhan 430074,China;School of Computer Science and Technology,Huazhong University of Science and Technology,Wuhan 430074,China)

机构地区:[1]华中科技大学武汉光电国家研究中心,武汉430074 [2]华中科技大学计算机科学与技术学院,武汉430074

出  处:《中国科学:信息科学》2025年第3期542-558,共17页Scientia Sinica(Informationis)

基  金:国家重点研发计划(批准号:2022YFB4501100)资助项目。

摘  要:检索增强生成(retrieval-augmented generation, RAG)是一种通过诸如近似最近邻搜索(approximate nearest neighbor search, ANNS)等知识检索手段融入外部知识,从而显著提升大型语言模型(large language model, LLM)生成质量的方法.然而,随着外部知识库的不断膨胀, ANNS索引的存储需求也急剧增加,使得海量数据存储在内存中变得不切实际.这进一步促进了基于磁盘的ANNS的发展和应用,但也大大增加了RAG系统的响应时间.为解决这一问题,本文提出了PipeRAG,该系统通过流水线化执行基于磁盘的ANNS检索与LLM的预填充过程,有效地重叠了知识检索和模型推理的延迟,从而在确保检索精度的同时有效提升了RAG系统的整体性能.具体而言, PipeRAG设计了两个核心机制:“ANNS自适应预取机制”和“RAG动态流水线调度策略”,前者能够根据当前的检索状态实时调整预取速度,从而在性能与精度之间取得最佳平衡;后者则综合考虑了ANNS预取速度与LLM分块预填充的延迟,动态地调整预填充任务的大小,以实现最优的流水效率.在实际负载下的广泛评估显示, PipeRAG成功地将基于磁盘的ANNS的RAG系统的响应延迟缩短了25%~71%,同时保持了极低的召回率损失.Retrieval-augmented generation(RAG)is a methodology that integrates external knowledge through knowledge retrieval techniques such as approximate nearest neighbor search(ANNS),significantly enhancing the generation quality of large language models(LLMs).However,as external knowledge bases continue to expand,the storage requirements for ANNS indexes also surge,making the storage of massive data in memory impractical.This has further promoted the development and application of disk-based ANNS,but it significantly increases the response time of RAG systems.To address this issue,this paper proposes the PipeRAG method,which effectively overlaps the latency of knowledge retrieval and LLM inference by pipelining the disk-based ANNS and LLM prefill processes,thereby enhancing the overall performance of RAG systems while ensuring retrieval accuracy.Specifically,PipeRAG features two core designs:the“ANNS adaptive prefetching mechanism”and the“RAG dynamic pipeline scheduling strategy”.The former design adjusts the prefetching speed in real time based on the current retrieval status,finding an optimal balance between performance and accuracy.The latter design dynamically adjusts the size of prefilling tasks by considering both ANNS prefetching speed and LLM chunkedprefill latency to achieve optimal pipeline efficiency.Our evaluations under real-world production workloads show that PipeRAG successfully reduces the response latency of RAG systems using disk-based ANNS by 25%to 71%,while maintaining extremely low recall loss.

关 键 词:检索增强生成(RAG) 近似最近邻搜索(ANNS) 大语言模型(LLM) 

分 类 号:TP3[自动化与计算机技术—计算机科学与技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象