基于改进SVM的中文专利文本分类比较研究  被引量:3

Comparative Study on Chinese Patent Text Classification Based on Improved SVM

在线阅读下载全文

作  者:杨超宇[1] 陈雯君 耿显亚 YANG Chaoyu;CHEN Wenjun;GENG Xianya

机构地区:[1]安徽理工大学人工智能学院,安徽淮南232000 [2]安徽理工大学经济与管理学院,安徽淮南232000 [3]安徽理工大学数学与大数据学院,安徽淮南232000

出  处:《武汉理工大学学报(信息与管理工程版)》2023年第2期292-298,303,共8页Journal of Wuhan University of Technology:Information & Management Engineering

基  金:国家自然科学基金项目(61873004);国家级大学生创新创业训练计划项目(202210361115X)。

摘  要:为深入挖掘中文专利文本特征,使专利类别划分更清晰、技术联系更紧密。首先,从专利信息平台爬取智能家居领域专利,构建智能家居专利信息语料库并进行分词与去停用词处理;其次,通过TF-IDF-LDA和均值Word2Vec两种自然语言处理算法,分别对语料库中的文本信息向量化并输出结果,绘制词云图展示筛选出的具有文档代表性的词语;最后,引入SVM进行文本分类并将两组平行实验的分类结果进行对比分析选出最优模型。通过样本上采样解决数据分布不均问题,进一步提升专利分类的准确率。结果表明:均值Word2Vec准确率为97.15%,而LDA准确率为86.91%,经过采样优化后的均值Word2Vec模型准确率为98.51%。为中文专利文本再分类提供新思路,有助于深入挖掘关键共现技术,促进国家产学研一体化发展。In order to dig deeper into the features of Chinese patent texts,the categories of patents are divided more clearly and the technologies are more closely connected.Firstly,a corpus of smart home patent information was constructed by crawling smart home patents from the patent information platform and processed by word separation and deactivation.Secondly,two natural language processing algorithms,TF-IDF-LDA and Word2Vec,are used to vectorise the text information in the corpus and output the results respectively,draw word cloud maps to show the selected words with document representation,introduce SVM for text classification and compare the classification results of two parallel experiments to select the optimal model.The problem of uneven data distribution is solved by sample over-sampling to further improve the accuracy of patent classification.The results show that the mean Word2Vec accuracy is 97.15%,while the LDA accuracy is 86.91%,and the mean Word2Vec model optimized by over-sampling has an accuracy of 98.51%.The method provides new ideas for the reclassification of Chinese patent texts,helps to deeply explore key co-occurring technologies and promotes the integrated development of national industries,universities and research institutes.

关 键 词:LDA主题模型 均值Word2Vec 支持向量机 产学研 中文专利分类 

分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象