小样本语义分析的漏洞实体抽取方法

A Method for Extracting Vulnerable Entities in Small Sample Semantic Analysis

作　　者：丁全张磊黄帅查正朋陶陶 Ding Quan;Zhang Lei;Huang Shuai;Zha Zhengpeng;Tao Tao(Electric Power Science Research Institute,State Grid Anhui Electric Power Co.,Ltd.,Hefei 230601;School of Information Science and Technology,University of Science and Technology of China,Hefei 230026;Institute of Advanced Technology,University of Science and Technology of China,Hefei 230031;School of Computer Science and Technology,Anhui University of Technology,Ma’anshan,Anhui 243032)

机构地区：[1]国网安徽省电力有限公司电力科学研究院,合肥230601 [2]中国科学技术大学信息科学技术学院,合肥230026 [3]中国科学技术大学先进技术研究院,合肥230031 [4]安徽工业大学计算机科学与技术学院,安徽马鞍山243032

出　　处：《信息安全研究》2025年第3期265-274,共10页Journal of Information Security Research

基　　金：安徽省高校协同创新项目(GXXT-2023-021)。

摘　　要：目前不同信息安全漏洞库标准各异,漏洞数据侧重点不同,关系相对独立,难以快速全面地获取高价值漏洞信息,需建立统一的漏洞实体标准,因此重点对漏洞数据中的实体抽取技术进行研究.大部分漏洞数据以非结构化中英文混合的自然语言形式呈现,基于规则的方法泛化性不强,基于人工智能的方法占用资源过高且依赖大量标注数据,为解决以上问题,提出一种小样本语义分析的漏洞实体抽取方法.该方法使用BERT(bidirectional encoder representations from transformers)预训练漏洞描述数据得到漏洞领域内的预训练模型,以更好地理解漏洞数据,减少对大量标注数据的依赖,此外,采用增量学习的自监督方式提高标注数据非常有限(1785个标注样本).所提模型抽取了漏洞领域中12类漏洞实体,实验结果表明,所提方法在漏洞实体抽取的效果上优于其他抽取模型,F1值达到0.8643,整体的识别性能较高,实现了对漏洞实体的精确抽取.At the moment,different information security vulnerability databases have different standards,with different focuses on vulnerability data and relatively independent relationships.It is difficult to quickly and comprehensively obtain high-value vulnerability information,and a unified vulnerability entity standard needs to be established.Therefore,this paper focuses on vulnerability data in entity extraction technology research.The majority of vulnerability data is provided in unstructured natural language form that combines Chinese and English,rule-based methods lack robust generalization,deep-learning-based methods occupy too many resources and rely on a large amount of annotated data.To address these issues,this paper presents a vulnerability entity extraction method with small sample semantic analysis.The method employs BERT pre-trained vulnerability data to generate a pre-trained model within the cybersecurity vulnerability domain,allowing for a better understanding of cybersecurity vulnerability data and reducing reliance on lager annotated data.Additionally,a self-supervised incremental learning approach is applied to improve model performance with very limited annotated data(1785 samples).The model in this paper extracts 12types of vulnerability entities in the field of cybersecurity,and the experimental results show that the method outperforms other models in the recognition and extraction of cybersecurity vulnerability entities,with an F1value of 0.8643.

关键词：小样本语义分析漏洞实体抽取 BERT CRF

分类号：TP183[自动化与计算机技术—控制理论与控制工程]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

小样本语义分析的漏洞实体抽取方法

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

小样本语义分析的漏洞实体抽取方法

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索