检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
机构地区:[1]河北大学电子信息工程学院,河北保定071000 [2]河北大学数学与计算机学院,河北保定071000
出 处:《计算机应用与软件》2016年第3期244-247,263,共5页Computer Applications and Software
基 金:国家自然科学基金项目(60903089);河北大学博士项目(Y2009157)
摘 要:在文本特征选择中,由于词语概率空间和词义概率空间的差异,完全基于词语概率的主题特征往往不能很好地表达文章的思想,也不利于文本的分类。为达到主题特征更能反映文章思想这一目的,提取出一种基于词义降维的主题特征选择算法。该算法通过在词林基础上构建"同义词表",作为词到词义的映射矩阵,构造一个基于词义之上的概率分布,通过LDA提取文本特征用于分类,分类准确率得到了明显提高。实验表明,基于此种方法所建立的主题模型将有更强的主题表示维度,通过该算法基本解决文本特征提取中词语概率和词义概率之间差异的问题。In text feature selection,due to the difference between words probability space and words meaning probability space,the theme features entirely based on words probability usually cannot well express the idea of the article,nor be conducive to text classification. To achieve the purpose that the theme features can better reflect the article thoughts,we extracted a theme feature selection algorithm which is based on words meaning dimension reduction. By constructing a " synonym table" based on words dictionary as the mapping matrix of words to words meaning,the algorithm constructs a words meaning-based probability distribution,and extracts text features by LDA for classification,the accuracy of classification is significantly improved. Experiments show that the theme model built by this method will have a stronger theme representation dimension,through the algorithm the problem of difference between words probability and words meaning probability in text feature extraction is basically solved.
分 类 号:TP3[自动化与计算机技术—计算机科学与技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.3