检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:王宏[1,2] 朱学立 曾涛[1,2] 乔东玉 郭甲腾[3] WANG Hong;ZHU Xue-li;ZENG Tao;QIAO Dong-yu;GUO Jia-teng(Henan Institute of Geological Survey;Henan Key Laboratory for Metalogenetic Process of Metal Mineral Resource and Resource Utilization,Zhengzhou 450000,China;School of Resources and Civil Engineering,Northeastern University,Shenyang 110000,China)
机构地区:[1]河南省地质调查院 [2]河南省金属矿产成矿地质过程与资源利用重点实验室,河南郑州450000 [3]东北大学资源与土木工程学院,辽宁沈阳110000
出 处:《软件导刊》2020年第4期211-218,共8页Software Guide
基 金:国家自然科学基金项目(41671404);中央高校基本科研业务费项目(N170104019);中国地质调查局智能地质调查支撑平台建设项目(DD20160355)。
摘 要:中文分词是地质大数据智能化知识挖掘难以回避的第一道基本工序。基于统计的分词方法受语料影响,跨领域适应性较差。基于词典的分词方法可以直接利用领域词典进行分词,但不能解决未登录词识别问题。在领域语料不足的情况下,为提高地质文本分词的准确率和未登录词识别率,提出一种基于统计的中文地质词语识别方法。该方法基于质串思想构建了地质基本词典库,用以改善统计分词方法在地质文本分词上的适应性。采用重复串查找方法得到地质词语候选集,并使用上下文邻接以及基于位置成词的概率词典,对地质词语候选集进行过滤,最终实现地质词语识别。实验结果表明,使用该方法对地质专业词语识别准确率达到81.6%,比通用统计分词方法提高了近60%。该方法能够识别地质文本中的未登录词,并保证地质分词的准确率,可以应用到地质文本分词工作中。Chinese word segmentation is the first basic process which is difficult to avoid in the intelligent knowledge mining of geological data.Word extraction based on statistics have poor performance across domain which is affected by corpus,the method based on dictionary can directly use the domain dictionary,but the problem of unlisted words recognition can not be resolved.In the case of insufficient domain corpus,a method of Chinese geological words recognition based on statistics is proposed,aiming at improving the accuracy of geological text segmentation and unlisted words recognition.Using prime string,the paper firstly constructs a base words library in geology,which has better performance across domain,then the geological words candidate set can be obtained by the algorithm of repeated string,and the final words can be recognized by using context adjacency analysis and position word probability to filter the candidate set.The experimental results show that the accuracy of the method is 81.6%,which is nearly 60%higher than that of the general statistical word segmentation method.This method is able to identify the unlisted geological words and ensure the accuracy,which can be applied to geological text segmentation.
关 键 词:地质文本 中文分词 质串 重复串 上下文邻接 位置成词概率
分 类 号:TP391[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:3.16.206.12