检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:宁滔 NING Tao(School of Computer Engineering,Guilin University of Electronic Technology,Beihai 536000,China)
机构地区:[1]桂林电子科技大学计算机工程学院,广西北海536000
出 处:《现代电子技术》2024年第9期164-168,共5页Modern Electronics Technique
基 金:(2021—2024)广西职业教育教学改革重点项目(GXGZJG2021A035)。
摘 要:在大数据中,不同类别之间可能存在数据分布不均衡的情况,即某些类别的数据样本数量远远少于其他类别。这种情况下,传统的采样方法无法正确反映所有类别的特征和差异。为提升大数据信息的应用性,文中研究海量大数据定向采样有差别挖掘算法。以网站统一资源定位器(URL)初始化为基础,在网络上抓取网页,采集网页的超文本标记语言(HTML)数据,提取定向数据的相关链接,并将其导入URL队列。根据网络搜索策略,实施相关的数据搜索和处理。完成数据搜索后,将自动进行下一网页的URL,继续进行海量大数据定向采样。结合模糊特征匹配与检测滤波方法实现大数据定向采样过程中的抗干扰处理。采用粗糙集算法实施挖掘,利用扩展差别矩阵对大数据决策表内的值实施约简,实现海量大数据的模式分类。实验结果显示,该算法数据采集过程中的丢包率基本控制在0.2%以下,具有较高的鲁棒性。In the big data,there may be imbalanced data distribution between different categories,where the number of data samples in certain categories is much smaller than that in others.In this case,the traditional sampling methods fail to accurately reflect the characteristics and differences of all categories.Therefore,the differential mining algorithm is studied for directional sampling of massive big data to broaden the application of big data information.On the basis of the initialization of the uniform resource locator(URL)on the website,web pages are crawled on the network,and hypertext markup language(HTML)data is collected from the web pages.The relevant connections of the directional data are extracted and imported into the URL queue.Relevant data search and processing are implemented according to network search strategies.After completing the data search,the URL of the next webpage will be automatically processed to continue with the directional sampling of massive big data.In combination with the fuzzy feature matching and detection filtering methods,the anti⁃interference processing in the directional sampling process of big data is achieved.Rough set algorithm is used for mining,and the extended difference matrix is used to reduce values in big data decision tables,so as to achieve the pattern classification of massive big data.The experimental results show that the packet loss rate of the algorithm during data collection is kept basically below 0.2%,and its robustness is strong.
关 键 词:海量大数据 网页抓取 定向采样 滤波处理 去冗余 粗糙集 扩展差别矩阵 决策规则
分 类 号:TN919-34[电子电信—通信与信息系统] TP311[电子电信—信息与通信工程]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.222