基于语义的文档特征提取研究方法  被引量:10

Semantic-based Feature Extraction Method for Document

在线阅读下载全文

作  者:姜芳[1,2] 李国和[1,2] 岳翔 

机构地区:[1]中国石油大学(北京)地球物理与信息工程学院,北京102249 [2]中国石油大学(北京)油气数据挖掘北京市重点实验室,北京102249 [3]中海油研究总院信息数据中心,北京100029

出  处:《计算机科学》2016年第2期254-258,共5页Computer Science

基  金:国家高新技术研究发展计划(2009AA062802);国家自然科学基金(60473125);中国石油(CNPC)石油科技中青年创新基金(05E7013);国家重大专项子课题(G5800-08-ZS-WX)资助

摘  要:中文文本特征词选取是文本处理的重要方面,对文本分类有重要影响。现有的文本特征提取方法存在生成特征向量维数高、依赖训练集、忽略低频关键词等不足。利用《同义词词林》计算词语之间的语义距离,通过聚类算法筛选类别的主题相关词,最后通过信息增益算法从主题相关词中选取特征词。以宏F值和微F值为评价指标,通过有效性实验和对比实验表明,该方法的文本特征选取效果优于其他经典算法。Feature extraction of Chinese documents is an important part in the document processing,and imposes great influence on the document classification.Pre-existing document feature extraction methods have many shortcomings,such as creating a feature vector of high dimensions,depending on training sets,ignoring low-frequency keywords,and so on.In this paper,the semantic distance between words was calculated based on the synonyms dictionary,and then theme related words of each classification were selected by the density clustering method,and finally the feature words were selected from the theme related words using the information gain algorithm.In order to validate the proposed method,one validation experiment and one comparison experiment were designed and the evaluation indexes including the macro-F value and the micro-F value were calculated.Experiment results show that the proposed document feature extraction method has better performance than other traditional algorithms.

关 键 词:特征词 语义距离 信息增益 文本分类 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象