检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
机构地区:[1]大连理工大学计算机科学与工程系,辽宁大连116023
出 处:《计算机工程与应用》2009年第1期140-143,共4页Computer Engineering and Applications
基 金:国家自然科学基金~~
摘 要:提出了一种带有节点频度的扩展DOM树模型—BF-DOM树模型(Block node Frequency-Document Object Module),并基于此模型进行网页正文信息的抽取。该方法通过向DOM树的某些节点上添加频度和相关度属性来构造文中新的模型,再结合语义距离抽取网页正文信息。方法主要基于以下三点考虑:在同源的网页集合内噪音节点的频度值很高;正文信息一般由非链接文字组成;与正文相关的链接和文章标题有较近的语义距离。针对8个网站的实验表明,该方法能有效地抽取正文信息,召回率和准确率都在96%以上,优于基于信息熵的抽取方法。A new module named BF-DOM tree is proposed in this paper,which extends the Document Object Module Tree by adding two properties,i.e. ,block node frequency and relativity,to some nodes.Using this module combined with semantic distance, this method extracts the primary content accurately from the same source based on three facts:noise nodes always have high node frequency property within a given website;primary content blocks are often made up of few link words and many text words;useful links are contained in a useful content blocks and have a close semantic distance with page titles.Experiment on eight respective websites shows the proposed method can identify the primary content blocks with higher precision and recall rate both above 96% which is better than the entropy based method.The method can reduce the storage requirement for search engines;thus,result in smaller indexes,faster search time, and better user satisfaction.
关 键 词:信息提取 带有节点频度的文档对象模型树 节点频度 语义距离
分 类 号:TP391[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.222