检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
机构地区:[1]长沙民政学院软件学院,长沙410082 [2]中南大学信息科学与工程学院,长沙410000
出 处:《计算机应用研究》2013年第12期3610-3613,共4页Application Research of Computers
基 金:国家教育部博士点新教师基金资助项目(20090162120087);湖南省科技计划资助项目(2009FJ3053)
摘 要:传统的TF* PDF方法提取的关键短语可精确地描述话题并进行新闻报道的追踪,但存在误将噪声数据识别为关键短语的情况。提出了一种基于位置权重TF* PDF的两段式关键短语提取方法滤除噪声数据。该方法将传统的TF* PDF算法与位置权重相结合,计算词汇与短语的权重,获取候选关键短语列表,关键短语的脉冲值则用于过滤列表中的噪声。通过关键短语识别进程根据位置信息、频率信息等将热点词汇组合成短语。TF* PDF位置权重算法同时也用于为短语分配权重,排名前K的短语被认为是热点关键短语。以真实网络数据为基础的实验结果表明,该提取方法与传统的TF* PDF提取方法相比,可更好地去除关键词短语中的绝对噪声,较好地改善了热点话题检测的准确度。Key phrase extracted by traditional TF * PDF method could represent topic accurately and track reports effectively, while sometimes noise data may be also recognized as key phrase. This paper proposed two-step key phrase extraction method based on improved TF * PDF to filter noise data. The method combined traditional TF * PDF and position-weight to compute weight of words and phrases, it used obtain candidate hot term list and the burst value of term to filter the noise in the list. In the second step, a phrase identification process combined hot terms into phrases using position information ,frequency information etc. At last the position-weighted TF * PDF algorithm are also used to weight the phrase, and chose the top K phrases as hot key phrases. The experiments on the real Web data indicate that this extraction method is able to filter noise data completely and provides a solution with improved quality at topic tracking in comparison with traditional TF * PDF.
分 类 号:TP391[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.44