搜索日志中热点查询的内容抽取  被引量:1

HOT QUERY CONTENT EXTRACTION IN SEARCH ENGINE LOGS

在线阅读下载全文

作  者:任育伟 吕学强[1] 李卓[1] 徐丽萍[2] 

机构地区:[1]北京信息科技大学网络文化与数字传播重点实验室,北京100101 [2]北京城市系统工程研究中心,北京100089

出  处:《计算机应用与软件》2015年第12期16-21,共6页Computer Applications and Software

基  金:国家自然科学基金项目(61271304);北京市教委科技发展计划重点项目暨北京市自然科学基金B类重点项目(KZ201311232037)

摘  要:搜索日志中蕴含海量的信息,利用搜索日志进行挖掘以及分析热点查询内容,对于提高搜索服务的质量有很大的价值和意义。在融合K-means聚类中心迭代优点和查询词向量长度信息的基础上,提出SKHC(类K-means层次聚类)方法,并以该方法对搜索日志聚类。然后,分析聚类后的查询用户数、查询频次、查询累计时间、查询数、统计量特征与热点查询的关系,提出基于各类热度值进行热点查询内容抽取的方法,同时融合了日志热度值和倒排日志频率统计特征。通过对抽取出的结果进行统计分析,并和日志所在月份发生的热点事件进行相关性比较,发现四川地震和北京奥运月平均热度分别达到最高的0.89和0.81,证明了该方法的有效性。Search engine logs contain massive information. Mining and analysing hot query contents by using these logs have great value and significance for improving the quality of search service. We proposed SKHC( similar k-means hierarchical clustering) method based on integrating the advantage of k-means clustering iteration and the information of querying word vector length,and clustered the search logs with the method. Then we analysed user numbers of query,query frequency,accumulated query time,query numbers,and the relationship between statistics characteristics and hot queries,all were after the clustering,and proposed a method which extracts the hot query contents based on various heat values,and meanwhile integrated the heat value of logs and the statistical frequency characteristic of the logs inversely listed. Through statistical analysis on the extracted results and comparing their correlation with the log of hot events happened in the very month,we found that the average month heat in regard to Sichuan earthquake and Beijing Olympics reached the highest 0. 89 and 0. 81 respectively,this proved the validity of the proposed method.

关 键 词:搜索日志 聚类 热点查询 热度 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象