分布式环境下的文档相似度研究与实现被引量：6

Research and Implementation of Textual Similarity in Distributed Environment

作　　者：赵华茗[1]

出　　处：《现代图书情报技术》2011年第7期14-20,共7页New Technology of Library and Information Service

摘　　要：针对传统的相似度计算方法在海量信息处理过程中暴露出的数据处理规模限制和性能不足等方面的瓶颈问题,以非结构化文档为研究对象,提出一种基于Hadoop分布式环境,结合Hive数据处理平台和PostgreSQL关系型数据库的文档相似度计算方法,并给出关键技术思路、具体实现步骤和实证研究,通过研究证明Hive SQL语言可有效简化分布式数据处理的复杂性,但实时性有待改进。Aiming at the performance issue and limitation on data set size in the process of mass - data mining of tradi- tional similarity algorithm, this paper takes unstructured textual data as research subject and introduces the method of Ha- doop distributed textual similarity algorithm, which combines Hive data mining platform with PostgreSQL RMDB, and de- scribes the basic technical ideas, implementations and the empirical research in details. The testing result shows that Hive SQL can effectively simplify the complexity of distributed data mining but its real - time performance should be improved.

关键词：HADOOP Hive 相似度非结构化

分类号：G350[文化科学—情报学]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

分布式环境下的文档相似度研究与实现被引量：6

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

分布式环境下的文档相似度研究与实现 被引量：6

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

分布式环境下的文档相似度研究与实现被引量：6