检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
机构地区:[1]湖南环境生物职业技术学院信息技术系,湖南衡阳421005
出 处:《计算机技术与发展》2014年第9期128-132,共5页Computer Technology and Development
基 金:湖南省教育科技计划项目(07D036);湖南省教育厅;财政厅联合资助项目(12C1056)
摘 要:TF-IDF算法是文本分类中一种常用的权重计算方法,但是TF-IDF仅仅考虑了特征项在文本中出现的次数以及该特征项在训练集中的出现频率,没有考虑特征项在各个类间的分布情况及特征项的语义信息。因此针对TF-IDF的不足提出了一种改进的TF-IDF算法,此算法既考虑了特征项在类内的分布情况又考虑了特征项的位置及长度等语义因素,能更好地反映特征项的重要性。用朴素贝叶斯分类器验证其有效性,实验结果表明该算法优于TF-IDF算法,能较好地提高文本分类的准确率。TF-IDF algorithm is a commonly used method of calculating weight in text classification,but TF-IDF considers only occurrence of feature in the text, as well as the frequency of characteristic appearing in the training set, and does not take into the distribution of characteristics in each class and the semantic information of characteristics account. In order to solve this problem, the improved TF-IDF algorithm has been proposed which considers not only the distribution condition of feature in class, but also the semantic factors such as the position of the feature, length of the feature. This algorithm can better reflect the importance of feature item, and its validity is verified by Naive Bayes classifier. The experiment results show that the proposed algorithm outperforms the TF-IDF algorithm,and the algorithm can improve the accuracy of text classification well.
分 类 号:TP301[自动化与计算机技术—计算机系统结构]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.145