基于模拟退火的在线Web文档内容数据质量评估  被引量:1

Data quality assessment of Web article content based on simulated annealing

在线阅读下载全文

作  者:韩京宇[1] 陈可佳[1] 

机构地区:[1]南京邮电大学计算机学院,南京210003

出  处:《计算机应用》2014年第8期2311-2316,2331,共7页journal of Computer Applications

基  金:国家自然科学基金资助项目(61003040;61100135)

摘  要:针对基于训练模型或用户交互的Web数据质量评估方法不能在线响应,也不能获取内容事实内涵的问题,提出一种基于模拟退火(SA)的在线Web文档内容数据质量评估(QASA)方法。首先,通过在Web上搜集主题相关文档,构建目标文档的相关空间,进一步采用开放式信息抽取技术抽取文档内容的事实;然后,采用SA技术在线构建两个最重要的数据质量维度即准确性和完整性的参照;最后,通过比对目标文档和维度参照的事实来量化数据质量维度。实验结果表明,QASA方法可以及时返回近似最优解,并保持与离线算法等同或高于10%的精度。该方法不仅能满足实时响应的要求,而且具有高的评估精度,可应用于在线识别高质量的Web文档。Because the existing Web quality assessment approaches rely on trained models, and users' interactions not only cannot meet the requirements of online response, but also can not capture the semantics of Web content, a data Quality Assessment based on Simulated Annealing (QASA) method was proposed. Firstly, the relevant space of the target article was constructed by collecting topic-relevant articles on the Web. Then, the scheme of open information extraction was employed to extract Web articles' facts. Secondly, Simulated Annealing (SA) was employed to construct the dimension baselines of two most important quality dimensions, namely accuracy and completeness. Finally, the data quality dimensions were quantified by comparing the facts of target article with those of the dimension baselines. The experimental results show that QASA can find the near-optimal solutions within the time window while achieving comparable or even 10 percent higher accuracy with regard to the related works. The QASA method can precisely grasp data quality in real-time, which caters for the online identification of high-quality Web articles.

关 键 词:数据质量 WEB文档 模拟退火 维度 事实 

分 类 号:TP311.13[自动化与计算机技术—计算机软件与理论] TP18[自动化与计算机技术—计算机科学与技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象