检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
机构地区:[1]福建师范大学数学与计算机科学学院,福州350007
出 处:《计算机应用》2014年第10期2869-2873,共5页journal of Computer Applications
基 金:国家自然科学基金资助项目(61175123)
摘 要:传统的n-gram文本特征提取方法会产生高维度的特征向量,高维数据不但增大了分类的难度,同时也会增加分类的时间。针对这一问题,提出了一种基于词性(POS)标注序列的特征提取方法,根据词性序列能够代表一类文本的这一个特点,利用词性序列组作为文本的特征以达到降低特征维度的效果。在实验中,词性序列特征提取方法比n-gram特征提取方法至少提高了9%的分类精度,降低4816个维度。实验结果表明,该方法能够适用于微博情感分类。Traditional n-gram feature extraction tends to produce a high-dimensional feature vector. High-dimensional data not only increases the difficulty of classification, but also increases the classification time. Aiming at this problem, this paper presented a feature extraction method based on Part-of-Speech (POS) tagging sequences. The principle of this method was to use POS sequences as text features to reduce feature dimension, according to the property that POS sequences can represent a kind of text. In the experiment, compared with the n-gram feature extraction, the feature extraction based on POS sequences at least improved the classification accuracy of 9% and reduced the dimension of 4 816. The experimental results show that the method is suitable for emotion classification in micro blog.
关 键 词:特征提取 词性 标注序列 微博情感分类 极性分类
分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.117

