基于距离相关系数的局部实例加权朴素贝叶斯文本分类算法  

A Locally Instance Weighting Naive Bayes Text Classification Algorithm Based on Distance Correlation Coefficient

在线阅读下载全文

作  者:骆洁琴 彭萍 胡桂开 

机构地区:[1]东华理工大学理学院,江西 南昌 [2]东华理工大学经济与管理学院,江西 南昌

出  处:《应用数学进展》2024年第6期2901-2911,共11页Advances in Applied Mathematics

摘  要:朴素贝叶斯算法具有简单高效的特点,被广泛应用于文本分类。方法要求属性之间满足条件独立性假设,然而该假设在现实中很难满足。同时,随着大数据时代到来,文本数据呈现非线性结构的特点,经典朴素贝叶斯算法拟合效果不高。为解决以上问题,本文提出了一种基于距离相关系数的局部实例加权朴素贝叶斯分类算法。首先,计算属性和类别的距离相关系数,并将其作为属性权重嵌入到文档距离测度中,构建一种新的距离度量方法;其次,测算训练样本和测试样本的距离,进行实例选择和实例加权,构建局部实例加权贝叶斯文本分类器;最后,利用WEKA平台上的15个文本数据集对算法性能进行实验比较。结果表明新提出的算法在分类精度上均优于三种经典的朴素贝叶斯文本分类器。Naive Bayes algorithm has the characteristics of simplicity and efficiency, and is widely used in text classification. The method requires the assumption of conditional independence between attributes, which is difficult to satisfy in reality. Meanwhile, with the advent of the big data era, text data exhibits non-linear structures, and the fitting effect of classical naive Bayesian algorithms is limited. To address these issues, a locally instance-weighted Naive Bayes classification algorithm based on distance correlation coefficient is proposed. Firstly, it calculates the distance correlation coefficient between attributes and classes, and embeds it as attribute weights into the document distance measure to construct a new distance measurement method. Secondly, it measures the distances between training samples and test samples, conducts instance selection and instance weighting, and constructs a locally instance-weighted Bayesian text classifier. Finally, the algorithm’s performance is experimentally compared with 15 text datasets from the WEKA platform. The results indicate that the proposed algorithm outperforms three classical Naive Bayes text classifiers in terms of classification accuracy.

关 键 词:文本分类 朴素贝叶斯 实例选择 实例加权 距离相关系数 

分 类 号:TP3[自动化与计算机技术—计算机科学与技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象