检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
出 处:《计算机工程》2016年第6期191-195,共5页Computer Engineering
摘 要:文本分词系统的词库未收录新词和组合词,而这些词具有很强的主题表现力。为此,基于共现词卡方值,提出一种关键词提取算法。使用语言技术平台的依存句法分词系统构建词语的关联关系,并提取共现词。应用卡方检验检测共现词的分布是否具有显著性差异。差异越大,共现词作为关键词的概率也越大,该算法同样适用于单个词。把单个词和共现词作为候选关键词,综合考虑候选关键词的卡方值、词频、词个数抽取全文关键词。实验结果表明,该算法提取关键词的效果优于TextRank算法,关键词提取的准确率达到38.07%,共现词的正确率达到80.15%。New words or compound words are not included in the dictionary of text segmentation system,however these words have strong theme performances.To address this problem,the key words extraction algorithm based on chi-square value of co-concurrence words is proposed.Co-concurrence words are extracted by the associations among words,which are established according to the dependency parsing from the Language Technology Platform (LTP).The chi-square is used to test whether obvious differences exist among the distributions of co-concurrence words.Co-concurrence words with higher obvious differences have greater probability of being key words.The algorithm is also valid for the single word.Taken the single word and co-concurrence words as candidate key words,the algorithm extracts full text key words with the consideration of the chi-square value,word frequency and number of the candidate key words.Experimental result shows that the key words extraction algorithm based on chi-square value of co-concurrence words is better than the TextRank algorithm as the precision of key words extraction reaches 38.07% and the accuracy of the co-concurrence words reaches 80.15%.
关 键 词:依存句法分析 共现词 卡方检验 候选关键词 显著性差异
分 类 号:TP301.6[自动化与计算机技术—计算机系统结构]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:3.144.41.22