基于特征有序对量化表示的文本分类方法被引量：4

Text categorization algorithm based on feature order pair quantization

出　　处：《清华大学学报（自然科学版）》2006年第4期527-529,533,共4页Journal of Tsinghua University(Science and Technology)

基　　金：国家"八六三"高技术项目(2001AA114071)

摘　　要：文本分类技术应尽可能包含语言中各种各样的约束信息，但目前常用的文本表示方法却忽视组成文本的语言特征顺序。该文采用基于聚类的方法实现语言特征有序对的快速量化表示，并由此导出新的基于特征有序对的文本表示方法以揭示文本中所呈现出的语言特征顺序信息。运用向量空间质心法，分别依据词对和词类对表示文本并在3个数据集上进行实验。结果表明性能优于基于单纯词或单纯词类的文本表示方法，宏平均F1值绝对提高分别为3％～4％和5％～7％（相对改善分别是4％～5％和8％～10％）。由此说明特征顺序信息对提升文本分类性能具有重要作用。Text categorization algorithms should contain the various constraints presented in the language, but most neglect the order information of language feature in the text, This paper presents a document representation scheme based on feature pair quantization using clustering to identify feature order information in the text, that is then combined with the vector space centroid algorithm. Tests were done for representing documents based on word pairs and word sense pairs respectively in three different data sets. The results show that the current method outperform traditional representations based on words or word sense, The average improvement of Micro-F1 for word pairs is 3%-4% and for word sense pair is 5%- 7%. Therefore, feature order information plays an important role for improving text categorization performance.

关键词：文本分类特征选择特征抽象特征变换奇异值分解

分类号：TP391[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于特征有序对量化表示的文本分类方法被引量：4

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于特征有序对量化表示的文本分类方法 被引量：4

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

基于特征有序对量化表示的文本分类方法被引量：4