检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:陈前华 胡嘉杰 江吉 吴豪 CHEN Qianhua;Hu Jiajie;JIANG Ji;WU Hao(Cloud Computing Center,Chinese Academy of Science,Dongguan Guangdong 523808,China;Artificial Intelligence Research Laboratory,Guangdong Electronics Industry Institute,Dongguan Guangdong 523808,China)
机构地区:[1]东莞中国科学院云计算产业技术创新与育成中心,广东东莞523808 [2]广东电子工业研究院人工智能实验室,广东东莞523808
出 处:《计算机应用》2021年第S01期20-24,共5页journal of Computer Applications
基 金:国家重点研发计划项目(2018YFB1004600)。
摘 要:针对复杂网页上主题信息被过多地与主题无关的广告、导航、版权等噪声信息隐藏的问题,提出一种基于长短期记忆的深度学习正文提取方法(LTE)。首先,设计一种根据超文本标记语言(HTML)中标签信息的数据划分策略:通过遍历HTML代码的文档对象模型(DOM)树来根据DOM树结构划分每一个具有文本信息的文本块;然后,通过预训练模型对每一个内容块的从属关系进行表征;最后,这些标签会被输入到用这种格式的数据预先训练好的长短期记忆(LSTM)网络模型进行主要内容正文判别。实验结果证明,模型能够有效拟合已标记的数据集,在训练集中的F1分数能稳定在0.96以上;对于不存在于训练集中的网页格式,对其正文的预测准确度也比两个传统正文抽取工具Readability和Newspaper3k的分别高47.54、19.02个百分点。由实验结果可知,LTE能够有效提取出网页中的正文内容。To deal with the problem of main content hidden by excessive information in complex webpage components,such as the advertisement irrelevant to the theme,navigation bar,and copyright notices,Long Short-term memory based Text Extraction(LTE)was proposed as a deep learning solution on text extraction.Firstly,an HTML(Hypertext Markup Language)tag information-based segmentation strategy was introduced to search the Document Object Model(DOM)tree of HTML code and segmentize tags according to the structure of DOM tree.Secondly,those segments were characterized with a pre-training mode.At last,those encoded tags were delivered to a trained LSTM network model trained with such encoded data for classification:text,or non-text.An experiment was carried out and the result shows that the model can fit the labelled dataset efficiently,achieveing an F1-score of 0.96 on its training dataset;for the webpage styles not in the training dataset,its prediction accuracy is 47.54 and 19.02 percentage points higher than those of Readability and Newspaper3k respectively.According to the experiments,LTE model has the ability to extract text from webpages.
关 键 词:文档对象模型 长短期记忆网络 预训练 深度学习 正文提取
分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:3.147.74.90