检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
机构地区:[1]西安交通大学电子与信息工程学院,陕西西安710049
出 处:《微电子学与计算机》2005年第12期4-7,共4页Microelectronics & Computer
基 金:国家863计划项目(8633010503)
摘 要:通过将文档表示为一棵后缀树,文章提出一种基于后缀树索引计算文档相似度的词序列核。首先根据文档的词序列构造出后缀树,然后根据后缀树词序列核计算文档间的相似度,最后利用支持向量机对文档进行分类。理论分析表明后缀树词序列核的计算只与比较文档的长度成线性关系,大大减少了序列核的计算时间。在reuters-21578文档集上将后缀树词序列核与词序列核、多项式核进行比较,实验结果表明在改善速度的同时,后缀树词序列核可达到与词序列核相当的性能,优于多项式核,更适于Web文档挖掘等应用。The use of string kernel (SK) and word sequence kernel (WSK) are novel ways of computing document similarity based on matching non-consecutive subsequences of characters, but the computing time of those kernels is expensive. This paper presents suffix tree word sequence kernel (STWSK), a modified word sequence kernel to compute the similarity of documents. To compute the new kernel, at first, suffix trees of documents are constructed with suffix tree constructing algorithm, and then the word sequence kernel is computed based on the suffix trees. With STWSK, the documents can be categorized using Support Vector Machine fast and efficiently. The theory analysis shows that the computing time of STWSK is linear to the length of the compared documents, which is less than that of SK and WSK obviously. We compare the classification performance of STWSK with WSK and polynomial kernel (PK) on Reuters-21578 text dataset. The experiment results show that STWSK is better than PK, and is not worse than WSK. So STWSK is more appropriate to the real Web documents mining tasks.
关 键 词:核学习方法 词序列核 字符串核 后缀树 WEB挖掘
分 类 号:TP31[自动化与计算机技术—计算机软件与理论]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.200