检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:李剑
机构地区:[1]南昌陆军学院战斗实验室,江西南昌330103
出 处:《电子科技》2012年第1期105-107,共3页Electronic Science and Technology
摘 要:为能够高效地把网页中的噪音信息过滤掉,采用基于改进的DOM树和BP神经网络的网页净化方法。根据DOM树和网页内容的特征,用HTMLParser建立内容块树,把网页中的内容按照一定的相关性分割成多个子块,从而把整个内容块的处理简化为处理各个子块。由统计可知,子内容块的内容具有明显的数值特征,可以该特征作为BP神经网络的学习来源。这样可把网页的净化问题转化成通过学习建立过滤模型的问题。实验结果证明,该方法在有主题的中文网页应用上取得了理想的效果。In order to remove the noisy information existing in web pages effectively, this paper proposes a method of web page purification based on the improved DOM tree and BP neural network. The establishment of a block tree by the DOM tree and web content using HTMLParser can split the whole content into several sub-block trees according to their relations, thus simplifying the processing of the whole block into the processing of sub blocks. Statistic data shows that the content of the sub block has evident numerical characteristics, so the sub block can be used as the learning source of BP. In this way, the purification of web pages is converted into establishing a purifying model through learning. Experimental results show that this method can achieve satisfactory results in the application to Chinese web pages with themes.
分 类 号:TP393.07[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.171