基于词语关联的文本特征词提取方法被引量：10

Text feature word selection based on relationship between words

出　　处：《计算机应用》2007年第12期3009-3012,共4页journal of Computer Applications

摘　　要：文本的特征描述是文本自动处理的基础工作之一,目前的文本特征描述一般采用加权VSM模型,该模型大都使用统计的和经验的加权算法,文本每一维特征的权重就是其TFIDF值,这种方法难以突出对文本内容起到关键性作用的特征,而且不能很好地揭示文本中词与词的关系。针对此缺点,提出了一种新的基于关键词语和词语共现频率的特征选择和权重计算方法。该方法在TF-IDF方法的基础上利用了文本的结构信息,同时运用互信息理论提取出对文本内容起到关键性作用的词语;权重计算则综合了词语位置、词语关系和词语频率等信息,突出了文本中关键词语的贡献,弥补了单纯使用TF-IDF权重函数进行计算的一些缺陷,并使文本的特征向量蕴涵了词与词的相关信息。通过采用KNN分类器进行实验,结果显示该方法比传统TF-IDF方法的平均分类准确率有明显提高。The description of text feature is one of the fundamental works of Natural Language Processing （NLP）. Some scholars often use the Vector Space Model （VSM） in description of text feature at present. VSM adopts statistical or experiential term weighting algorithm, term weight in each dimension of the text feature is its TF-IDF value. But TF-IDF is unable to emphasize the significance of key terms which contribute mainly to the content of a text. TF-IDF does not consider the relationship between words and is important in information extraction. In allusion to the disadvantage mentioned above, a new feature selection and term weighting approach based on keywords and word co-occurrence was proposed. Based on TF-IDF, the structure information and mutual information were employed to extract key words of the text; and word location, word dependence, word frequency, document frequency, and relationship between words in weighting a term were integrated. In SVM classification experiment, the approach outperforms the traditional TF-IDF approach with a boost in average precision.

关键词：词语关联词共现率向量空间模型特征提取权重计算

分类号：TP391.1[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于词语关联的文本特征词提取方法被引量：10

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于词语关联的文本特征词提取方法 被引量：10

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

基于词语关联的文本特征词提取方法被引量：10