检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:党浩予 DANG Hao-yu(Xi'an Mingde Institute of Technology,Xi'an 710124,Shaanxi)
出 处:《电脑与电信》2023年第8期90-93,共4页Computer & Telecommunication
摘 要:现如今网站的页面布局较为复杂、展示内容与文本信息较多,在单位时间内提高文本大数据提取数量难度较大,基于此以Python爬虫技术为支撑,开展网页内容文本大数据提取方法的设计研究。首先,对网页内容进行综合解析,获取网页内容文本数据,计算网页内容文本复杂度;其次,引进Python爬虫技术,计算特征信息权重,进行文本大数据特征的识别;最后,通过提取的文本大数据特征,构建网页内容文本相空间,采集大数据矢量信息,根据预设的条件,进行大数据关键信息的维度划分及信息提取。通过实验对比,在相同的条件下对比传统方法,本文设计的方法提取文本大数据数量最多、能力最强,可以提取更多的文本大数据信息,即该方法的文本大数据提取能力较强。Nowadays,the page layout of the website is more complex,and there are more display contents and text information,so it is difficult to increase the amount of text big data extraction in a unit time.Based on this,supported by Python crawler technology,the design and research on the extraction method of Web page content text big data are carried out.Firstly,it comprehensively analyzes the Web page content,obtains the Web page content text data,and calculates the Web page content text complexity;Secondly,it introduces Python crawler technology to calculate feature information weights for text big data feature recognition;Finally,by extracting text big data features,a Web page content text phase space is constructed,big data vector information is collected,and key information dimensions of big data are divided and extracted based on preset conditions.Through experimental comparison,compared with traditional methods under the same conditions,the method designed in this paper has the largest amount and the strongest ability to extract text big data,and can extract more text big data information,that is,the method has strong ability to extract text big data.
关 键 词:Python爬虫技术 关联维度信息 提取方法 大数据 文本 网页内容
分 类 号:TP391.3[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.116