网络日志挖掘中基于时间间隔的会话切分  被引量:24

Session identification based on time intervals in Web log mining

在线阅读下载全文

作  者:庄力可[1] 寇忠宝[1] 张长水[1] 

机构地区:[1]清华大学自动化系,北京100084

出  处:《清华大学学报(自然科学版)》2005年第1期115-118,共4页Journal of Tsinghua University(Science and Technology)

摘  要:针对网络日志挖掘中的会话切分问题,提出了一种基于时间间隔的方法。该方法在相邻页面访问时间间隔超出某阈值时切分会话,针对特定IP的阈值根据其频率矢量来定义。实验表明:代理服务器IP和单用户IP的频率矢量具有不同特性,代理服务器IP的频率矢量具有Power-law的特点,而单用户IP的频率矢量具有Gauss分布的特点,在此基础上提出一种基于Gauss假设的方法来设定不同单用户IP的阈值。与传统的对所有IP地址使用单一的先验阈值进行切分的方法相比,该方法更为合理有效。This paper presents a method for session identification based on an analysis of intervals of user access logs. This method separates the access logs into distinct sessions at points where the access intervals exceed some threshold. The threshold for a specific IP is defined by the statistic of its frequency vectors. Tests show that the frequency vectors of proxy IPs and single user IPs are different. For a proxy IP, the frequency vector often shows a power-law distribution, however for a single user IP, it approximates a Gauss distribution. A method based on the Gauss hypothesis was proposed for computing different thresholds for each single user IP. Compare to the traditional approach that experimentially defines a uniform threshold for all IP addresses, the method presented is more reasonable and effective.

关 键 词:数据库理论 网络日志挖掘 会话切分 时间间隔 频率矢量 

分 类 号:TP311.131[自动化与计算机技术—计算机软件与理论]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象