检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:王军辉[1] 胡铁军[1] 李丹亚[1] 钱庆[1] 方安[1]
机构地区:[1]中国医学科学院医学信息研究所,北京100020
出 处:《情报学报》2011年第2期197-203,共7页Journal of the China Society for Scientific and Technical Information
摘 要:为了在不利用词典的条件下实现对中文生物医学文本的有效切分,结合中文生物医学文本专业术语多、新术语不断出现和结构式摘要的特点,引入一种基于重现原理的无词典分词方法,并在实际应用过程中从分词长度上限值的设定和层次特征项抽取两方面对其进行了改进.实验结果表明,该方法可以在不需要词典和语料库学习的情况下,实现对生物医学文本中关键性专业术语的有效抽取,分词准确率约为84.51%.最后,基于本研究中的分词结果,对生物医学领域的词长分布进行了初步探讨,结果表明中文生物医学领域的词长分布与普通汉语文本有非常大的差异.研究结果对在处理中文生物医学文本时N-gram模型中N值的确定具有一定的参考价值.In order to segment Chinese biomedical text without thesaurus, combining with the characteristics of Chinese biomedical text, such as lots of specialized terms, new terms emerging and Structured Abstract, the paper introduces a method of Chinese word segmentation without thesaurus based on recurrence, and improves it in the process of practical ap-plication in two ways. First, do not set the upper limit of the length of terms, second, extracting terms and hierarchical terms at one time. Experimental results show that, without the help of thesaurus and corpus learning, the algorithm can extract the crucial specialized terms in the biomedical text effectively, and the Accuracy Rate is about 84. 51%. Finally, a preliminary study for the word length distribution in the field of biomedicine has been done, and the results prove that, the word length distribution in the field of Chinese biomedicine is very different from General Chinese's, it could provide reference for determining the value of N in N-gram model in the process of Chinese biomedical text.
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.85