基于词语关联的文本特征词提取方法  被引量:10

Text feature word selection based on relationship between words

在线阅读下载全文

作  者:廖浩[1] 李志蜀[1] 王秋野[1] 张意[1] 

机构地区:[1]四川大学计算机学院,成都610064

出  处:《计算机应用》2007年第12期3009-3012,共4页journal of Computer Applications

摘  要:文本的特征描述是文本自动处理的基础工作之一,目前的文本特征描述一般采用加权VSM模型,该模型大都使用统计的和经验的加权算法,文本每一维特征的权重就是其TFIDF值,这种方法难以突出对文本内容起到关键性作用的特征,而且不能很好地揭示文本中词与词的关系。针对此缺点,提出了一种新的基于关键词语和词语共现频率的特征选择和权重计算方法。该方法在TF-IDF方法的基础上利用了文本的结构信息,同时运用互信息理论提取出对文本内容起到关键性作用的词语;权重计算则综合了词语位置、词语关系和词语频率等信息,突出了文本中关键词语的贡献,弥补了单纯使用TF-IDF权重函数进行计算的一些缺陷,并使文本的特征向量蕴涵了词与词的相关信息。通过采用KNN分类器进行实验,结果显示该方法比传统TF-IDF方法的平均分类准确率有明显提高。The description of text feature is one of the fundamental works of Natural Language Processing (NLP). Some scholars often use the Vector Space Model (VSM) in description of text feature at present. VSM adopts statistical or experiential term weighting algorithm, term weight in each dimension of the text feature is its TF-IDF value. But TF-IDF is unable to emphasize the significance of key terms which contribute mainly to the content of a text. TF-IDF does not consider the relationship between words and is important in information extraction. In allusion to the disadvantage mentioned above, a new feature selection and term weighting approach based on keywords and word co-occurrence was proposed. Based on TF-IDF, the structure information and mutual information were employed to extract key words of the text; and word location, word dependence, word frequency, document frequency, and relationship between words in weighting a term were integrated. In SVM classification experiment, the approach outperforms the traditional TF-IDF approach with a boost in average precision.

关 键 词:词语关联 词共现率 向量空间模型 特征提取 权重计算 

分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象