检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:WooHyun Park Nawab Muhammad Faseeh Qureshi Dong Ryeol Shin
机构地区:[1]Department of Electrical and Computer Engineering,Sungkyunkwan University,Suwon,16419,Korea [2]Department of Computer Education,Sungkyunkwan University,Seoul,03063,Korea
出 处:《Computers, Materials & Continua》2022年第4期517-535,共19页计算机、材料和连续体(英文)
摘 要:Spam mail classification considered complex and error-prone task in the distributed computing environment.There are various available spam mail classification approaches such as the naive Bayesian classifier,logistic regression and support vector machine and decision tree,recursive neural network,and long short-term memory algorithms.However,they do not consider the document when analyzing spam mail content.These approaches use the bagof-words method,which analyzes a large amount of text data and classifies features with the help of term frequency-inverse document frequency.Because there are many words in a document,these approaches consume a massive amount of resources and become infeasible when performing classification on multiple associated mail documents together.Thus,spam mail is not classified fully,and these approaches remain with loopholes.Thus,we propose a term frequency topic inverse document frequency model that considers the meaning of text data in a larger semantic unit by applying weights based on the document’s topic.Moreover,the proposed approach reduces the scarcity problem through a frequency topic-inverse document frequency in singular value decomposition model.Our proposed approach also reduces the dimensionality,which ultimately increases the strength of document classification.Experimental evaluations show that the proposed approach classifies spam mail documents with higher accuracy using individual document-independent processing computation.Comparative evaluations show that the proposed approach performs better than the logistic regression model in the distributed computing environment,with higher document word frequencies of 97.05%,99.17%and 96.59%.
关 键 词:NLP big data machine learning TFT-IDF spam mail
分 类 号:TP311.13[自动化与计算机技术—计算机软件与理论]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:18.216.7.205