网络信息挖掘系统IDGS的实现  被引量:5

THE DESIGN AND IMPLEMENTATION OF AN INFORMATION MINING SYSTEM

在线阅读下载全文

作  者:邹涛[1] 戚广智[1] 蔡丽娟[1] 张福炎[1] 

机构地区:[1]南京大学多媒体计算机研究所软件新技术国家重点实验室,江苏南京210093

出  处:《南京大学学报(自然科学版)》2000年第2期183-188,共6页Journal of Nanjing University(Natural Science)

基  金:江苏省科委95科技攻关资助项目!(No :BE96 0 17)

摘  要:网络信息挖掘是网络信息处理领域中的一项新课题 .介绍一个基于WWW的信息挖掘系统IDGS的设计与实现 ,并讨论了基于统计的文本信息特征提取技术和BP神经网络模型在网络信息挖掘中的应用 ,及在WWW上进行信息挖掘所需采用的方法和策略 .Information Mining on Internet is a new technology of network information processing, and is also an important application of Data Mining in Internet area. This paper describes the design and implementation of an Information Mining system, called IDGS, which can gather HTML documents and mine out documents users want by using BP neural network model and Backpropagation algorithm on World Wide Web. Data Mining(DM) and Knowledge Discovery in Databases (KDD) is defined as the non trivial extraction of implicit, previously unknown and potentially useful information from data. Data Mining is a new technology arising with the problem of “Rich Data Poor Information”. Network Information Mining is an application of Data Mining on Internet, and is referred to extract potential pattern from target learning samples, and then to extract useful information from Internet resources with the pattern. IDGS system consists of 4 modules: Pattern Extraction and Feature Selection Module, Raw Document Collection Module, Pattern Marching Module and Document Database Module, and adopts BP neural network model with BP algorithm to march information content. The neural networks that IDGS system adopts have 20 input neurons, one output neuron and 2 hidden layers. Each input neuron corresponds to one feature extracted from learning samples, and the output neuron corresponds to the relevance with mining target. The strategy of feature selection is based on statisics. We select the words or phrases as the features if the frequency they appear in relevance documents is more than in the unrelevant documents. To segment Chinese sentence and compute the frequency of words, we setup 3 dictionaries: Main dictionary, Thesaurus dictionary and Implini dictionary. We would involve all the words in that 3 dictionaries when we compute word frequency, so that we can solve the problem of words diversity. Meanwhile, we set several weight coefficients such as CofTitle, CofLinkText, CofH1 and CofH2 etc. to utilize the mark text of HTML. Collecting raw d

关 键 词:神经网络 WWW 数据挖掘 IDGS 网络信息挖掘 

分 类 号:TP391[自动化与计算机技术—计算机应用技术] TP311.13[自动化与计算机技术—计算机科学与技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象