融合Finetuned-BERTopic和大模型的技术主题识别方法研究

Research on Method for Technology Topic Identification Based on Finetuned-BERTopic and Large Language Models

作　　者：张凯杨敏纳隗玲[1] Zhang Kai;Yang Minna;Wei Ling(School of Information,Shanxi University of Finance and Economics,Shanxi Taiyuan 030006)

机构地区：[1]山西财经大学信息学院,山西太原030006

出　　处：《情报理论与实践》2025年第3期189-198,共10页Information Studies:Theory & Application

基　　金：国家自然科学基金青年项目“基于多视角科技知识图谱融合的新兴技术演化路径识别与预测方法研究”的成果,项目编号:72304176。

摘　　要：[目的/意义]文章提出一种结合科技文本预训练语言模型微调的BERTopic和大模型的技术主题识别方法,深入学习科技文本内容中蕴含的语义特征,从非结构化的科技文本中识别技术主题,并对其进行自动解读以归纳生成主题标签,减少人工干预,进一步提升技术主题识别的确度与效度,为扩展和丰富技术主题识别研究方法体系提供理论与工具支持。[方法/过程]采用PAT SPECTER预训练语言模型对科技文本进行向量化表征,结合KeyBERT构建Finetuned-BERTopic模型,建模技术词汇间的语义关联关系,抽取特定领域的技术术语,以技术术语为表征单位对科技文本中蕴含的技术主题进行识别;使用GPT-4o大模型和提示工程对上述识别的技术主题内容进行自动评价并解读生成主题标签;在此基础上,以生成式人工智能领域为例,验证本文方法的有效性。[结果/结论]实验验证表明,对比LDA主题模型、Top2Vec、BERTopic等模型,文章提出的方法有效提高了技术主题识别的准确性,且可显著减少人工干预,实现更高效的技术主题发现。[Purpose/significance]This paper proposes a technology topic identification method that combines BERTopic model finetuned with scientific literature domain-specific Pre-trained Language Model and Large Language Model.The proposed method dives into learning the semantic features contained in the content of scientific literature,identifies technology topics from unstructured scientific texts,evaluates and interprets the content of identified technology topics to generate topic labels automatically.By reducing manual intervention and further improving the accuracy and validity of technology topic identification,the proposed method can provide theoretical and tool support for expanding and enriching the research method system of technology topic identification.[Method/process]The paper proposed a novel method framework for generating technology topic representations which consists of three core steps.First,the Finetuned-BERTopic model was built to identify the technology topics hidden in the scientific literature.In the Finetuned-BERTopic,the scientific literature document was converted to its embedding representation using a scientific domain-specific pre-trained language model named PAT SPECTER and the semantic relationship between technology tokens were modeled using KeyBERT to generate the complete technical phrases in topic representations.Then,the identified technology topic contents were evaluated and interpreted to generate topic labels automatically using a Large Language Model named GPT-4o and prompt engineering.Lastly,based on this,take the domain of generative artificial intelligence as an example to verify the effectiveness of the method in this paper.[Result/conclusion]Experimental results show that the method proposed in this paper effectively improves the accuracy of technology topic identification and significantly reduces manual intervention,achieving more efficient technology topic extraction across a variety of benchmarks involving LDA,Top2Vec,BERTopic and other models.

关键词：科技文本技术主题识别微调的BERTopic 大语言模型生成式人工智能

分类号：TP18[自动化与计算机技术—控制理论与控制工程] TP391.1[自动化与计算机技术—控制科学与工程]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

融合Finetuned-BERTopic和大模型的技术主题识别方法研究

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

融合Finetuned-BERTopic和大模型的技术主题识别方法研究

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索