检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
机构地区:[1]河海大学计算机及信息工程学院,南京210098 [2]通化师范学院计算机科学系,通化134002
出 处:《情报学报》2008年第6期851-856,共6页Journal of the China Society for Scientific and Technical Information
基 金:基金项目:国家自然科学基金资助项目(No.60673186和60571048).
摘 要:若要有效地实现文本分类,关键是对高维特征空间进行降维,降维方法分为特征选择和特征提取。本文对已有特征选择方法分析后发现,这些方法仅利用文档数来选择特征,没有考虑特征项的权重。为了找出本质特征,我们提出了一种基于特征项与类之间模糊关系的特征选择方法,引入特征项权重来确定其隶属度。采用KNN分类器,在Reuters-21578标准文本数据集上进行了训练和测试。实验表明,宏平均和微平均都达到了最高,分别为81.82%和94.88%,宏平均比IG,CHI提高了4.73%和1.12%,微平均比IG,CHI提高了1.56%和0.21%。For the effective implementation of text categorization, the key step is dimensionality reduction for highdimensional feature space, including feature selection and feature extraction. In the paper, after the previous methods of feature selection analyzed, they used only a few document numbers to choose features, while not using term weights. To discover the essential features through full advantage of term weights, training samples and classes, a method of feature selection based on fuzzy relation between terms and classes is proposed. The degree of membership is determined through term weights. Using K-Nearest Neighbor classifier, experimental results on the corpus of Reuters-21578 show that the proposed method is the best, 81.82% and 94.88% respectively on Macro-F1 and Micor-F1. It is increased 4.73% and 1.12% on Macro-F1, 1.56% and 0.21% on Micro-F1 than IG, CHI.
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.117