检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:李瑞轩[1] 廖东杰 辜希武[1] 文坤梅[1] 赵铄乂 董新华[1]
机构地区:[1]华中科技大学计算机科学与技术学院,武汉430074
出 处:《计算机科学与探索》2014年第12期1409-1421,共13页Journal of Frontiers of Computer Science and Technology
基 金:国家自然科学基金;国家高技术研究发展计划(863计划);华中科技大学自主创新基金~~
摘 要:文档主题标引是当前个性化智能检索的重要前提,但面对大规模海量数据资源时,主题标引也成为性能瓶颈。当前在Map Reduce框架上设计实现的主题标引算法,通常存在启动任务耗时长,中间数据过多地进行磁盘IO等缺陷。为了解决此类问题,采用YARN(yet another resource negotiator)作为底层分布式资源管理平台,选择更加合适的计算框架来改善计算性能。针对文档主题标引算法计算步骤多、阶段性强的特点,选择有向无环图(directed acyclic graph,DAG)计算模型进行算法实现,避免不必要的作业拆分,从而减少中间结果的磁盘IO。另外,考虑到Map Reduce的排序策略耗时较多,而有些计算无需对结果排序,故可以改用基于Hash的数据归约策略来提高计算性能,但这又会带来随机读的问题。利用固态硬盘高速随机读的特性,设计相应的优化计算策略来解决随机读的问题。通过实验对比发现,以YARN为底层管理平台,在此基础上选择合适的计算框架并加以优化,可以有效改善分布式计算的性能。Subject indexing is a very important component in personalized intelligent search system. However, the huge amount of data resource makes it a great challenge in processing performance. Nowadays, the subject indexing over MapReduce computing framework has been widely used, which has shortcomings, such as time-consuming of starting the tasks and too many disk IOs. This paper adopts YARN (yet another resource negotiator) as the underlying platform, and chooses more appropriate calculation frameworks to improve the performance. For the feature of subject indexing algorithm, which is multistage, the directed acyclic graph (DAG) model is selected to avoid unnecessary operations of job split, which reduces the disk IOs of intermediate results. In addition, considering the sorting strategy is time-consuming, this paper adopts Hash-based data gathering strategy to improve computing performance. However, the new policy will bring the problem of random read. This paper designs an optimization strategy, which takes advantage of the feature of high-speed random read of solid state disk (SSD), to further improve the computa-tional efficiency. Through the experimental results, choosing targeted computing framework based on YARN and optimizing it, can effectively improve computing performance.
关 键 词:主题标引 YARN平台 有向无环图计算框架 固态硬盘
分 类 号:TP319[自动化与计算机技术—计算机软件与理论]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.69