检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
机构地区:[1]清华大学自动化系,北京100084
出 处:《清华大学学报(自然科学版)》2005年第1期115-118,共4页Journal of Tsinghua University(Science and Technology)
摘 要:针对网络日志挖掘中的会话切分问题,提出了一种基于时间间隔的方法。该方法在相邻页面访问时间间隔超出某阈值时切分会话,针对特定IP的阈值根据其频率矢量来定义。实验表明:代理服务器IP和单用户IP的频率矢量具有不同特性,代理服务器IP的频率矢量具有Power-law的特点,而单用户IP的频率矢量具有Gauss分布的特点,在此基础上提出一种基于Gauss假设的方法来设定不同单用户IP的阈值。与传统的对所有IP地址使用单一的先验阈值进行切分的方法相比,该方法更为合理有效。This paper presents a method for session identification based on an analysis of intervals of user access logs. This method separates the access logs into distinct sessions at points where the access intervals exceed some threshold. The threshold for a specific IP is defined by the statistic of its frequency vectors. Tests show that the frequency vectors of proxy IPs and single user IPs are different. For a proxy IP, the frequency vector often shows a power-law distribution, however for a single user IP, it approximates a Gauss distribution. A method based on the Gauss hypothesis was proposed for computing different thresholds for each single user IP. Compare to the traditional approach that experimentially defines a uniform threshold for all IP addresses, the method presented is more reasonable and effective.
关 键 词:数据库理论 网络日志挖掘 会话切分 时间间隔 频率矢量
分 类 号:TP311.131[自动化与计算机技术—计算机软件与理论]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.30