基于开始定界符的自动Web信息抽取被引量：1

Automatic Web Information Extraction Based on the Start Delimiter

作　　者：白钰洁 BAI Yujie(School of Computer and Information Technology,Northeast Petroleum University,Daqing 163000)

出　　处：《微型电脑应用》2019年第11期141-142,146,共3页Microcomputer Applications

摘　　要：为了从网页中快速获得隐含的有用信息,提出一种基于开始定界符的Web信息抽取方法。首先通过网络爬虫获取样本网页;其次对样本网页进行预处理;再通过循环神经网络训练预处理后的样本网页,获得开始定界符;最后利用lxml解析库实现目标抽取页面Web信息的定位与抽取。这样将半结构化的网页自动整理成结构化的知识,以便人们的查询及再利用。通过三个慕课网站的抽取实验,证明该方法抽取效果良好,可以抽取有用信息并具有可移植性。In order to quickly obtain the implied useful information from Web pages,a Web information extraction method based on the start delimiter is proposed.This method firstly uses Web crawler to obtain some sample Web pages.It then preprocesses the sample Web pages.The start delimiter is obtained through the preprocessed sample Web page of the recurrent neural network training.Finally,lxml parsing library is used to locate and extract the target Web page information.By doing so,semi-structured Web pages are automatically organized into structured knowledge for people to search and reuse.After one extraction experiment on three MOOC Websites,it has been proved that this method can extract useful information and be portable.

关键词：WEB信息抽取循环神经网络开始定界符 lxml

分类号：TP311[自动化与计算机技术—计算机软件与理论]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于开始定界符的自动Web信息抽取被引量：1

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于开始定界符的自动Web信息抽取 被引量：1

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

基于开始定界符的自动Web信息抽取被引量：1