检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:彭鹏 徐红姣[1] PENG Peng;XU HongJiao(Institute of Scientific and Technical Information of China,Beijing 100038,P.R.China)
出 处:《数字图书馆论坛》2025年第2期65-72,共8页Digital Library Forum
摘 要:随着网络信息的爆发式增长,从海量的网络文本信息中识别有价值的科技情报并对其进行智能分类成为开源科技情报分析的关键。针对开源科技情报文本的特点,构建了面向开源科技情报分析的文本智能去噪与分类一体化模型。结合大语言模型与提示工程的自动标注方法进行噪声数据标注及文本分类数据标注;基于预训练语言模型进行噪声识别与过滤,过滤非科技情报文本;利用多语言预训练模型及蒸馏技术,改进损失函数设计,解决类别分布不均和数据不足的问题,实现在一定程度上提升多标签科技情报文本分类的精度和稳定性的目标。实验结果表明,与TextCNN与BERT方法相比,所提出的方法具有较高的分类性能、更好的鲁棒性和适应性。With the explosive growth of network information,identifying valuable technology intelligence from massive network text information and classifying it intelligently have become the key to open-source technology intelligence analysis.Based on the characteristics of open-source technology intelligence texts,this paper constructs an integrated model of text denoising and classification for open-source technology intelligence analysis.It combines large language model with automatic annotation method of prompt engineering to annotate noise data and text classification data.A pre-trained language model is constructed for noise recognition and filtering,filtering non-technology intelligence texts.Multilanguage pre-trained models and distillation techniques are used to improve the loss function design,solve the problems of uneven class distribution and insufficient data,and achieve the goal of improving the accuracy and stability of multi-label technology intelligence text classification to a certain extent.The experimental results show that compared with TextCNN and BERT methods,the method proposed in this paper has higher classification ability,robustness,and adaptability.
关 键 词:开源科技情报 文本分类 信息过滤 预训练语言模型
分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.49