基于多属性的海量Web数据关联存储及检索系统  被引量:13

An associated storage and retrieval system of massive Web data based on multi-attributes

在线阅读下载全文

作  者:罗芳[1] 李春花[1] 周可[1] 黄永峰[2] 廖正霜 

机构地区:[1]华中科技大学计算机科学与技术学院,湖北武汉430074 [2]清华大学电子工程系,北京100084

出  处:《计算机工程与科学》2014年第3期404-410,共7页Computer Engineering & Science

基  金:国家863计划资助项目(2012AA011004);清华大学自主科研项目基金(20111081023)

摘  要:传统的Web数据检索一般采用全文检索方法,该方法具有很好的灵活性,但舆情分析往往需要获得相关的网页属性及统计信息。针对传统的Web检索方法无法满足上述需求,基于Hadoop平台设计并实现了一种基于多属性的海量Web数据的关联存储及检索系统,为舆情分析提供基础检索与统计服务。主要实现HDFS上基于属性的网页数据的分类和聚类存储,解决小文件存储同时提高数据访问吞吐量;建立原始网页数据与属性数据之间的关联映射;基于HBase的已有索引机制,结合分布式本地索引机制解决基于HBase的动态属性多条件选择查询的辅助索引问题。Traditional Web Retrievals commonly use the full text search method which has good flex ibility. However, as the analysis of public opinion usually needs relative information of web attributes and statistics, the traditional retrieval method can not satisfy it well. An associated storage and retrieval system based on the Hadoop platform is designed and implemented, which can offer good basic service for the analysis of public opinion. Firstly, the associated storage of web data based on HDFS is realized by machine learning. Secondly, the problem of small files storage together with the access efficiency of associated data is solved. Thirdly, the mapping between original web data and the extracted attributes is established. Finally, the retrieval of dynamic multiple attributes based on the existed indexing on HBase and the distributed local indexing are realized.

关 键 词:分类存储 多条件选择查询 关联映射 辅助索引 

分 类 号:TP391.3[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象