检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:谢树泳 刘之亮 Xie Shuyong;Liu Zhiliang(Huizhou Power Supply Bureau,Guangdong Power Grid Co.Ltd.,Huizhou 516000,China;China Southern Power Grid,Guangzhou 510000,China)
机构地区:[1]广东电网有限责任公司惠州供电局,广东惠州516000 [2]南方电网有限责任公司,广州510000
出 处:《河南师范大学学报(自然科学版)》2025年第2期124-130,共7页Journal of Henan Normal University(Natural Science Edition)
基 金:国家自然科学基金(52377103,52277148);南方电网科技项目(0313002023030103AJ0003,031300KK52222091).
摘 要:为了减少电网人身安全事故,利用数据挖掘技术构建和分析事故多维数据,建立准确的预警模型十分必要.其中一个极具挑战性的问题是如何在海量网页中自动化采集人身事故样本数据.提出一种朴素贝叶斯模型与PageRank结合的主题爬虫算法.首先采用中文文本分割和设置关键词词频的方法对数据预处理,进行特征选择后,构建并训练朴素贝叶斯分类模型,从而实现电网事故分类准确度的显著提升.然后利用PageRank算法对精确分类后的网页进行主题相关性排序,有效避免普通爬虫方法中出现的主题漂移问题.实验结果表明,不论是在相同时间还是相同页面数的条件下,该方法的页面收获率均高于单独使用朴素贝叶斯分类器或PageRank的收获率,即本方法能够在大量网页中更高效、准确地爬取电网事故信息.In order to reduce the number of personal safety accidents in the power grid,it is necessary to construct and analyze multi-dimensional data of accidents to build precise early warning models by using data mining techniques.One of the challenging problems is to automate the collection of accident data in large-scale websites.In this paper,we propose a focused crawler algorithm that combines Naive Bayes model and PageRank algorithm.First,by adopting the Chinese text segmentation method and setting keyword frequency,data are preprocessed.After feature selection,a Naive Bayesian classification model is constructed and trained,so as to achieve a significant increase in the classification accuracy of power grid accidents.Then,the PageRank algorithm is used to sort the topic relevance of the accurately classified web pages,which effectively avoids the problem of topic drift that common crawler methods often suffer from.The experimental results show that the page harvesting rate of the proposed algorithm is higher than that of using the Naive Bayesian classifier or the PageRank algorithm alone,regardless of the same time budget or the same number of searched pages.Thus,this method is capable of crawling information about power grid accidents more efficiently and accurately among a large number of web pages.
关 键 词:电网安全 人身事故 朴素贝叶斯模型 PAGERANK算法 主题爬虫
分 类 号:TP391[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:52.14.216.203