检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:赵辉[1] 杨树强[1] 陈志坤[1] 尹洪[1] 金松昌[1]
机构地区:[1]国防科学技术大学计算机学院,长沙410073
出 处:《计算机研究与发展》2014年第3期606-617,共12页Journal of Computer Research and Development
基 金:国家"八六三"高技术研究发展计划基金项目(2012AA012600;2012AA01A402;2012AA01A401;2011AA010702;2010AA012505);国家自然科学基金项目(60933005;91124002);国家科技支撑计划基金项目(2012BAH38B04;2012BAH38B06);国家242信息安全计划基金项目(2011A010)
摘 要:近年来,MapReduce并行计算模型受到工业界和学术界广泛关注.基于该模型的系统实现已在谷歌、雅虎、Facebook等大公司内部成功应用.然而,基于MapReduce的系统实现最初用于解决海量无结构、半结构化数据的批处理问题,例如生成倒排索引、计算网页的pagerank、日志分析等,在设计上缺乏针对海量结构化数据进行交互式分析处理的优化考虑,例如:它总是采用全数据集强力扫描的数据处理模式,这有悖于结构化数据管理中常用的操作模式———选择性查询分析处理.针对该问题,引入传统数据库管理领域中常用的全局索引技术,将其应用在基于MapReduce模型的开源项目Hadoop上,以block为粒度对Hadoop分布式文件系统上的结构化数据构建全局索引结构,并给出一种面向范围查询分析的作业编译与调度执行优化算法,主要目标是基于应用语义及辅助索引结构减少不必要的map任务数,进而优化作业的调度开销和执行开销.在实验验证阶段,给出了80%,50%,30%,10%四种数据选择率在3种集群规模下的优化效果,发现作业响应时间最高可提升5倍,I?O开销最高提升10倍,任务调度开销最高提升11倍.Recently, MapReduce parallel computing paradigm has gained extensive attention from industry and academia. MapReduce works well in Google, Yahoo! and Facebook on massive data processing. However, MapReduee-based systems originally were used to manage massive un- structured and semi-structured data, such as inverted indexing, Web page ranking, log analyzing etc. They ignored the optimizing of structured data, such as the brute-force scanning, which is inefficient for some common workloads in structured data management, such as select, filter etc. For this problem, we introdue a global indexing technology, which has been widely used in database, aiming to optimizing queries and analysis in a range of the overall dataset. Global index will help reduce redundant map tasks, resulting in decreasing the cost of I/O and scheduling. Finally, we evaluate the effect of our framework by four data selection ratios which are 80%, 50%, 30% and 10% under different cluster sizes. We find that the response time has 5x improvement at most, I/O cost improves 10x at most and cost of scheduling improves llx at most.
分 类 号:TP316.4[自动化与计算机技术—计算机软件与理论]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:3.16.160.142