检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:方秋莲[1] 王培锦 隋阳 郑涵颖 吕春玥 王艳彤 FANG Qiulian;WANG Peijin;SUI Yang;ZHENG Hanying;LV Chunyue;WANG Yantong(School of Mathematics and Statistics,Central South University,Changsha 410083,China)
机构地区:[1]中南大学数学与统计学院
出 处:《吉林大学学报(理学版)》2019年第6期1479-1484,共6页Journal of Jilin University:Science Edition
基 金:湖南省统计科研项目(批准号:2018A01);全国大学生创新创业项目(批准号:S20190533497)
摘 要:采用朴素Bayes算法建立中文文本自动分类器,并研究相关参数的选择问题,以实现中文文本的高效分类.首先在模型训练阶段,采用N-gram模型处理训练数据集提取特征向量;然后使用朴素Bayes算法建立文本分类器;最后在模型测试阶段,为提高分类准确率,使用词频-反文档频率算法对测试样本进行特征向量提取.实例分析结果表明,在提取训练集特征向量时,2-gram模型和4-gram模型的特征提取效果最佳;在选取特征向量长度时,长度为25000的特征向量可使分类准确率出现最大增幅并保证较高准确率;在确定特征项词性方面,同时选取动词和名词可使分类器准确率达到最高,仅选取动词时准确率最低.Naive Bayesian algorithm was used to build an automatic Chinese text classifier,and the selection of relevant parameter was studied to realize the efficient classification of Chinese text.Firstly,in model training stage,N-gram model was used to extract feature vectors from training data sets.Secondly,Na ve Bayesian algorithm was used to build a text classifier.Finally,in model testing stage,in order to improve the classification accuracy,term frequency-inverse document frequency algorithm was used to extract feature vectors of the test samples.The results show that when extracting feature vectors from training sets,2-gram model and 4-gram model have the best effect of feature extraction;when selecting the length of feature vectors,the length of 25000can make the greatest increment of classification accuracy and ensure a higher accuracy;when determining the characteristic of feature items,the accuracy is the highest when both verbs and nouns are selected,and the lowest when only verbs are selected.
关 键 词:朴素Bayes分类器 特征选择 TFIDF算法 N-GRAM模型
分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.117