检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:张凯 杨敏纳 隗玲[1] Zhang Kai;Yang Minna;Wei Ling(School of Information,Shanxi University of Finance and Economics,Shanxi Taiyuan 030006)
出 处:《情报理论与实践》2025年第3期189-198,共10页Information Studies:Theory & Application
基 金:国家自然科学基金青年项目“基于多视角科技知识图谱融合的新兴技术演化路径识别与预测方法研究”的成果,项目编号:72304176。
摘 要:[目的/意义]文章提出一种结合科技文本预训练语言模型微调的BERTopic和大模型的技术主题识别方法,深入学习科技文本内容中蕴含的语义特征,从非结构化的科技文本中识别技术主题,并对其进行自动解读以归纳生成主题标签,减少人工干预,进一步提升技术主题识别的确度与效度,为扩展和丰富技术主题识别研究方法体系提供理论与工具支持。[方法/过程]采用PAT SPECTER预训练语言模型对科技文本进行向量化表征,结合KeyBERT构建Finetuned-BERTopic模型,建模技术词汇间的语义关联关系,抽取特定领域的技术术语,以技术术语为表征单位对科技文本中蕴含的技术主题进行识别;使用GPT-4o大模型和提示工程对上述识别的技术主题内容进行自动评价并解读生成主题标签;在此基础上,以生成式人工智能领域为例,验证本文方法的有效性。[结果/结论]实验验证表明,对比LDA主题模型、Top2Vec、BERTopic等模型,文章提出的方法有效提高了技术主题识别的准确性,且可显著减少人工干预,实现更高效的技术主题发现。[Purpose/significance]This paper proposes a technology topic identification method that combines BERTopic model finetuned with scientific literature domain-specific Pre-trained Language Model and Large Language Model.The proposed method dives into learning the semantic features contained in the content of scientific literature,identifies technology topics from unstructured scientific texts,evaluates and interprets the content of identified technology topics to generate topic labels automatically.By reducing manual intervention and further improving the accuracy and validity of technology topic identification,the proposed method can provide theoretical and tool support for expanding and enriching the research method system of technology topic identification.[Method/process]The paper proposed a novel method framework for generating technology topic representations which consists of three core steps.First,the Finetuned-BERTopic model was built to identify the technology topics hidden in the scientific literature.In the Finetuned-BERTopic,the scientific literature document was converted to its embedding representation using a scientific domain-specific pre-trained language model named PAT SPECTER and the semantic relationship between technology tokens were modeled using KeyBERT to generate the complete technical phrases in topic representations.Then,the identified technology topic contents were evaluated and interpreted to generate topic labels automatically using a Large Language Model named GPT-4o and prompt engineering.Lastly,based on this,take the domain of generative artificial intelligence as an example to verify the effectiveness of the method in this paper.[Result/conclusion]Experimental results show that the method proposed in this paper effectively improves the accuracy of technology topic identification and significantly reduces manual intervention,achieving more efficient technology topic extraction across a variety of benchmarks involving LDA,Top2Vec,BERTopic and other models.
关 键 词:科技文本 技术主题识别 微调的BERTopic 大语言模型 生成式人工智能
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.15