检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
机构地区:[1]上海工程技术大学高职学院,上海200437 [2]上海宝信软件,上海201203
出 处:《科技创新导报》2009年第2期43-45,共3页Science and Technology Innovation Herald
摘 要:数据质量问题是企业在构建商务智能系统中遇到的最重要的问题之一,在处理面向VLDB数据质量的时候,对模糊重复记录的识别和整合非常困难。文章中提出了一种改进的面向VLDB数据质量处理算法,即先通过基于聚类的N-gram的改进算法来检测相似重复记录,采用pair-wise来计算相似重复度,用一个固定大小的优先队列窗口来聚类相似重复记录,同时引入转换关闭准则生成一种多路聚类方法,提高聚类的准确度。本文的算法在语言识别和关键字检测方面获得高于90%的准确率。Data quality problem is very important in design of business intelligence system. It is difficult to detect and eliminate duplications when processing data quality questions in very large database. This article proposes an improved arithmetic for very large database. First an efficient N-Gram based clustering algorithm is adopted to detect duplicated records. And then apply Pair-Wise comparison algorithm to the inspection of the exact degree of the similar records. For detecting approximately duplicate records, an improved algorithm that employs the priority queue is presented; at the same time, a transitive-closure phase based multi-pass clustering is proposed to improve the data accuracy. The algorithm offered in this article acquires more than 90% accuracy in both language identification and keyword detection.
分 类 号:TP311[自动化与计算机技术—计算机软件与理论]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:18.222.251.131