检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
机构地区:[1]哈尔滨工业大学信息管理与信息系统研究所,哈尔滨150001
出 处:《计算机工程与应用》2012年第16期21-25,共5页Computer Engineering and Applications
基 金:国家自然科学基金(No.70801022);教育部博士点基金(No.200802131048)
摘 要:朴素贝叶斯分类器在处理垃圾邮件过滤任务时,往往存在数据稀疏问题。由于语料库中特征出现遵循Zipf定律,所以单纯依靠增加训练语料方式难以解决该问题。为克服数据稀疏问题,引入数据平滑算法计算贝叶斯模型中缺失特征的补偿概率。通过领域术语抽取与概念相关模型增加分类中语义知识处理能力。采用增量式学习方法完成动态在线学习过程。Ling-Spam垃圾邮件语料库实验表明该方法提高分类精度2.51%,在国家863语料表明该方法比Laplace原则提高了3.05%。When applied to deal with Spam Filter task, Nave Bayes almost suffers from the sparse data problem.Moreover, this problem is hardly to be solved by expanding the corpora, since the distribution of features in the corpora complies with the Zipf’s law. Three aspects of work are done to alleviate the above problem in this paper. Firstly,a smoothing algorithm is adopted and embedded into Nave Bayes to estimate the compensation probability of unseen feature. Secondly, domain term extraction and semantic knowledge are introduced in the Spam Filter model to enhance the performance of semantic process. Thirdly, an incremental learning method is introduced to perform the iterative learning. The experimental corpora comes from the Ling-Spam, and the result of open test shows that this method increases the precision by 2.51%. In addition, the experiment in National 863 Evaluation on Text Classification shows that the Nave Bayes performance with Good-Turing algorithm is 3.05% higher than that with Laplace.
分 类 号:TP393[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.42