大语言模型的应急情报生成能力测评基准

A Benchmark for Evaluating Crisis Information Generation Capabilities in LLMs

作　　者：韩瑞莲安璐[1,2,3] 周炜 Han Ruilian;An Lu;Zhou Wei(Center for Studies of Information Resources,Wuhan University,Hubei Wuhan 430072;School of Information Management,Wuhan University,Hubei Wuhan 430072;Institute of Data Intelligence,Wuhan University,Hubei Wuhan 430072)

机构地区：[1]武汉大学信息资源研究中心,湖北武汉430072 [2]武汉大学信息管理学院,湖北武汉430072 [3]武汉大学数据智能研究院,湖北武汉430072

出　　处：《情报理论与实践》2025年第4期54-63,43,共11页Information Studies:Theory & Application

基　　金：国家社会科学基金重大项目“不确定环境下韧性社会智能情报支持与决策研究”(项目编号:23&ZD230)的成果。

摘　　要：[目的/意义]近年来,大语言模型(Large Language Models,LLMs)因其强大的自然语言处理能力而备受关注,为应急情报领域智能决策生成提供了新的技术选择。文章针对LLMs在应急情报领域中的应用潜力,提出并构建一套全面的测评基准,旨在科学合理地评估LLMs的应急情报生成能力。[方法/过程]利用GPT-4.0自动化构建一个包含自然灾害、事故灾难、公共卫生事件和社会安全事件等26种应急场景的测评数据集,选取国内外8种具备中文处理能力的LLMs作为待评估模型,设置模型生成情报的内容质量、表达质量、可行程度和效用质量等多维度评价标准,采用人工评分与机器评分结合的方法对各模型展开测评。[结果/结论]研究结果表明,Claude 3.5 Sonnet在应急情报生成任务中表现最佳,尤其在处理复杂多变的自然灾害和事故灾难时,该模型生成的情报更为全面且具有高度实操性。国内模型如文心大模型4.0 Turbo和讯飞星火V4.0虽整体测评得分略低于国际顶尖模型,但在特定的应急场景中仍表现突出。相关部门可以根据具体的应急场景,选择相应的LLMs来辅助情报生成,以提高应急处置的精准度和效率。[Purpose/significance]LLMs,with their powerful natural language processing capabilities,have played an increasingly important role in the field of crisis informatics in recent years.This study proposes and constructs a comprehensive evaluation benchmark,CIEval,to scientifically evaluate the crisis information generation abilities of LLMs.[Method/process]This study first constructed CIEval,a comprehensive evaluation dataset covering twenty-six crisis scenarios such as natural disasters,accident disasters,public health events,and social security events.Eight LLMs with Chinese processing capabilities were selected for evaluation,and multidimensional criteria,including content quality,expression,feasibility,and utility,were established for information generation.A combination of manual and machine scoring methods was used to assess each model.[Result/conclusion]The results show that Claude 3.5 Sonnet outperforms other models in crisis information generation tasks,especially when dealing with complex and variable natural disasters and accident disasters.The information it generates is comprehensive and highly practical.In contrast,domestic models like ERNIE 4.0 Turbo and iFlytek Spark V4.0 have slightly lower overall scores than top international models but still perform exceptionally well in specific crisis scenarios.Regarding this,emergency departments can select appropriate LLMs for information generation based on specific crisis types to better respond to emergencies.

关键词：大语言模型应急情报大模型测评情报生成能力测评基准

分类号：TP3[自动化与计算机技术—计算机科学与技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

大语言模型的应急情报生成能力测评基准

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

大语言模型的应急情报生成能力测评基准

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索