检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
机构地区:[1]南京邮电大学物联网学院,江苏南京210003
出 处:《计算机技术与发展》2017年第11期171-175,共5页Computer Technology and Development
基 金:2015江苏省产学研前瞻性联合研究项目(BY2015011-02)
摘 要:随着大数据时代的来临,大数据在各个行业应用越来越广泛。大数据在运营商行业的应用也很普遍,但同时也遇到了很多技术问题,其中家庭画像的塑造是运营商大数据的一个核心问题。如何提取和识别固网宽带下的终端类型是一个有待解决的问题。不像移动网,固网宽带由于没有信令通道,所以不携带任何准确的终端信息,因而对固网下的终端类型识别比较困难。传统方法都是采用解析和匹配HTTP GET报文中的UA字段进行识别。但这种方法由于UA的非标准化,以及终端数量和种类众多的缘故而导致终端类型的识别准确率低下。文中采用Hadoop框架,利用Hive中UDF的方法,结合分布式爬虫获取终端库,可以更加快速准确地识别出用户上网终端信息。实验结果表明,终端识别准确率可以达到92%以上,相比传统方法有了大幅提升。With the coming of the era of big data,big data is more and more widely applied in various industries, which is also done in op- erators industry, but many technical problems are found simultaneously, of which family portraits of shaping is a core for operators of large data. How to extract and identify the terminal type of fixed-line broadband is a problem needed to be solved. Unlike mobile net- work, fixed-line broadband don't take any accurate terminal information due to lack of signaling channel, so it is hard to conduct termi- nal type identification in fixed-line. The traditional method adopts UA fields of HTTP GET message parsing and matching for identifica- tion,but it is low in identification accuracy because of UA non-standardized and the large amounts of terminal number and varieties. Based on the Hadoop framework, the UDF of Hive is used, and combined with the distributed crawler for obtainment of terminal library, the user terminal information online is identified more quickly and accurately. According to the experiment, the accuracy of terminal iden- tification can reach above 92% ,a substantial increase compared with the traditional method.
关 键 词:终端识别 HADOOP User Defined Function(UDF) 分布式爬虫 固网宽带 大数据运营
分 类 号:TP31[自动化与计算机技术—计算机软件与理论]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:13.59.173.30