检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
机构地区:[1]安徽大学计算机科学与技术学院,合肥230601 [2]计算智能和信号处理教育部重点实验室,合肥230601
出 处:《计算机工程与应用》2016年第13期95-100,共6页Computer Engineering and Applications
基 金:安徽省高校自然科学研究重点项目(No.KJ2013A020);安徽省自然科学基金(No.11040606M133)
摘 要:针对短文本特征较少而导致使用传统文本分类算法进行分类效果并不理想的问题,提出了一种融合BTM主题特征和改进了特征权重计算的综合特征提取方法来进行短文本分类。方法中,在TF-IWF的基础上降低词频权重并引入词分布熵,衍生出新的算法计算权重。结合BTM主题模型中各主题下的主题词对词数较少的文档进行补充,并选择每篇文档在各个主题下的概率分布作为另一部分文档特征。通过KNN算法进行多组分类实验,结果证明该方法与传统的TF-IWF等方法计算特征进行比较,F1的结果提高了10%左右,验证了方法的有效性。Short texts are normally featured with less content, looser text format, varied sentence length and relativelycomplex sentence structure. Consequently, the effects of traditional classification algorithms are quite unsatisfactory. Thispaper presents an authentic comprehensive method by the fusion of BTM theme features and well-improved weight calculationmethod for short text classification. In order to achieve this, two steps are in necessity. Firstly, the paper reduces theterm frequency weight according to the TF-IWF. In the meantime, it introduces the word distribution probability value sothat a new algorithm for computing weights will derive. Secondly, it uses the topic words of BTM topic model to complementempty documents. Meanwhile, the probability distribution of each document in each topic will be carefully selectedas the document’s other features. Experimental results indicate that with the help of this newly created method, the resultsof F1 has been improved by around 10% compared to the original TF-IWF method.
分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.28