一种基于词加权LDA模型的专利文献分类方法  被引量:5

A Patent Document Classification Method Based on Word Weighted LDA Model

在线阅读下载全文

作  者:孙伟[1] 刘文静 葛丽阁 余璇[1] SUN Wei;LIU Wen-jing;GE Li-ge;YU Xuan(School of Information Engineering,Shanghai Maritime University,Shanghai 201306,China)

机构地区:[1]上海海事大学信息工程学院,上海201306

出  处:《计算机技术与发展》2019年第3期23-29,共7页Computer Technology and Development

基  金:国家自然科学基金青年项目(61203240)

摘  要:传统的主题模型在进行文本分类时,特征词多选取统计规律下的高频词,而在专利文献分类中,多数专业词汇往往被高频词所淹没,造成主题模型在专利文献分类的准确率不高。对此,提出一种基于词加权的有监督LDA主题模型用于专利文献的分类。从专业词与高频词的共现关系出发,利用KeyGraph算法选取特征表征能力更优的关键词,再利用互信息函数计算各关键词权重,建立专业词字典。在此基础上,建立一个有监督的LDA模型,将词加权扩展至LDA模型,并采用Gibbs Sampling进行参数估计。在专利文献上进行分类实验,与LDA模型及其两种变型模型相比,该模型分类准确率分别平均提高了4.62%、3.74%和3.26%。表明该模型选取的高区分度的专业词汇与主题关联度更高,分类效率和准确率均有明显提高。When the traditional topic model carries on the text classification,its characteristic words choose the high frequency words under the law of statistics.However,in the patent literature classification,most professional words are often overwhelmed by high frequency words,resulting in the low accuracy of the topic model in the classification of patent documents.Therefore,we present a supervised LDA topic model based on word weighted for the classification of patent documents.Based on the co-occurrence relationship between professional words and high-frequency words, KeyGraph algorithm is used to select the keywords with better characterization,and the mutual information function is used to calculate the weight of each keyword to establish a professional word dictionary.On this basis,a supervised LDA model is built,the word weighted is extended to the LDA model and Gibbs Sampling is used to estimate the parameters.Compared with the LDA model and its two variant models,the classification accuracy of the model is improved by 4.62%,3.74% and 3.26% respectively on the patent documents.It shows that the high degree of specialization words selected by the model has a higher degree of relevance to the topic,and the classification efficiency and accuracy are significantly improved.

关 键 词:加权模型 LDA KeyGraph算法 专利文献分类 

分 类 号:TP18[自动化与计算机技术—控制理论与控制工程]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象