基于后缀树词序列核挖掘Web文档被引量：2

Suffix-Tree Word Sequence Kernel for Web Document Mining

出　　处：《微电子学与计算机》2005年第12期4-7,共4页Microelectronics & Computer

基　　金：国家863计划项目(8633010503)

摘　　要：通过将文档表示为一棵后缀树,文章提出一种基于后缀树索引计算文档相似度的词序列核。首先根据文档的词序列构造出后缀树,然后根据后缀树词序列核计算文档间的相似度,最后利用支持向量机对文档进行分类。理论分析表明后缀树词序列核的计算只与比较文档的长度成线性关系,大大减少了序列核的计算时间。在reuters-21578文档集上将后缀树词序列核与词序列核、多项式核进行比较,实验结果表明在改善速度的同时,后缀树词序列核可达到与词序列核相当的性能,优于多项式核,更适于Web文档挖掘等应用。The use of string kernel （SK） and word sequence kernel （WSK） are novel ways of computing document similarity based on matching non-consecutive subsequences of characters, but the computing time of those kernels is expensive. This paper presents suffix tree word sequence kernel （STWSK）, a modified word sequence kernel to compute the similarity of documents. To compute the new kernel, at first, suffix trees of documents are constructed with suffix tree constructing algorithm, and then the word sequence kernel is computed based on the suffix trees. With STWSK, the documents can be categorized using Support Vector Machine fast and efficiently. The theory analysis shows that the computing time of STWSK is linear to the length of the compared documents, which is less than that of SK and WSK obviously. We compare the classification performance of STWSK with WSK and polynomial kernel （PK） on Reuters-21578 text dataset. The experiment results show that STWSK is better than PK, and is not worse than WSK. So STWSK is more appropriate to the real Web documents mining tasks.

关键词：核学习方法词序列核字符串核后缀树 WEB挖掘

分类号：TP31[自动化与计算机技术—计算机软件与理论]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于后缀树词序列核挖掘Web文档被引量：2

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于后缀树词序列核挖掘Web文档 被引量：2

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

基于后缀树词序列核挖掘Web文档被引量：2