检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:邹涛[1] 戚广智[1] 蔡丽娟[1] 张福炎[1]
机构地区:[1]南京大学多媒体计算机研究所软件新技术国家重点实验室,江苏南京210093
出 处:《南京大学学报(自然科学版)》2000年第2期183-188,共6页Journal of Nanjing University(Natural Science)
基 金:江苏省科委95科技攻关资助项目!(No :BE96 0 17)
摘 要:网络信息挖掘是网络信息处理领域中的一项新课题 .介绍一个基于WWW的信息挖掘系统IDGS的设计与实现 ,并讨论了基于统计的文本信息特征提取技术和BP神经网络模型在网络信息挖掘中的应用 ,及在WWW上进行信息挖掘所需采用的方法和策略 .Information Mining on Internet is a new technology of network information processing, and is also an important application of Data Mining in Internet area. This paper describes the design and implementation of an Information Mining system, called IDGS, which can gather HTML documents and mine out documents users want by using BP neural network model and Backpropagation algorithm on World Wide Web. Data Mining(DM) and Knowledge Discovery in Databases (KDD) is defined as the non trivial extraction of implicit, previously unknown and potentially useful information from data. Data Mining is a new technology arising with the problem of “Rich Data Poor Information”. Network Information Mining is an application of Data Mining on Internet, and is referred to extract potential pattern from target learning samples, and then to extract useful information from Internet resources with the pattern. IDGS system consists of 4 modules: Pattern Extraction and Feature Selection Module, Raw Document Collection Module, Pattern Marching Module and Document Database Module, and adopts BP neural network model with BP algorithm to march information content. The neural networks that IDGS system adopts have 20 input neurons, one output neuron and 2 hidden layers. Each input neuron corresponds to one feature extracted from learning samples, and the output neuron corresponds to the relevance with mining target. The strategy of feature selection is based on statisics. We select the words or phrases as the features if the frequency they appear in relevance documents is more than in the unrelevant documents. To segment Chinese sentence and compute the frequency of words, we setup 3 dictionaries: Main dictionary, Thesaurus dictionary and Implini dictionary. We would involve all the words in that 3 dictionaries when we compute word frequency, so that we can solve the problem of words diversity. Meanwhile, we set several weight coefficients such as CofTitle, CofLinkText, CofH1 and CofH2 etc. to utilize the mark text of HTML. Collecting raw d
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.64