基于核矩阵学习的XML文档相似度量方法  被引量:10

Similarity Measures for XML Documents Based on Kernel Matrix Learning

在线阅读下载全文

作  者:杨建武[1,2] 陈晓鸥[1,2] 

机构地区:[1]北京大学计算机研究所 [2]北京大学文字信息处理技术国家重点实验室,北京100871

出  处:《软件学报》2006年第5期991-1000,共10页Journal of Software

摘  要:XML文档作为一种新的数据形式,成为当前的研究热点.XML文档间相似度的计算是XML文档分析、管理及文本挖掘的基础.结构链接向量模型(structuredlinkvectormodel,简称SLVM)是一种综合考虑XML文档结构信息与内容信息进行XML文档相似度量的方法.体现XML文档结构单元关系的核矩阵在结构链接向量模型中扮演着重要角色.为自动捕获XML文档结构单元关系,提出了两种核矩阵的学习算法,分别是基于支持向量机(supportvectormachine,简称SVM)的回归学习算法和基于矩阵迭代的学习算法.相似搜索实验对比结果表明,基于核矩阵学习方法的XML文档相似度量方法的准确性明显优于其他方法.进一步实验表明,基于矩阵迭代学习的核矩阵学习算法与基于支持向量机的回归学习算法相比,不仅具有更高的准确性,而且所需训练文档更少、计算代价更小.XML document as a new data model has been analyses, management and text mining for XML documents a hot research area. Similarity measure is a basic of Structured Link Vector Model (SLVM) is a document model for the XML documents' similarity measure based on both the content and structure. The kernel matrix, which describes the relations between the structure units, plays an important role in the SLVM, In the paper, two algorithms are derived to learn the kernel matrix for capturing the relations between the structure units: one is based on the support vector machine and the other is based on matrix iterative analysis, For the performance evaluation, the proposed similarity measure is applied to similarity search. The experimental results show that the similarity measure based on kernel matrix learning outperform significantly the traditional measures. Furthermore, comparing with the kernel matrix leaning algorithm based on the support vector machine (SVM)'s regression, the kernel matrix leaning algorithms based on matrix iterative analysis not only acquires higher precision but also needs less training documents and cost.

关 键 词:XML文档 相似度量 核矩阵学习 文本挖掘 

分 类 号:TP181[自动化与计算机技术—控制理论与控制工程]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象