中文生物医学文本无词典分词方法研究  被引量:4

Research on Method for Chinese Word Segmentation without Thesaurus in Chinese Biomedical Text

在线阅读下载全文

作  者:王军辉[1] 胡铁军[1] 李丹亚[1] 钱庆[1] 方安[1] 

机构地区:[1]中国医学科学院医学信息研究所,北京100020

出  处:《情报学报》2011年第2期197-203,共7页Journal of the China Society for Scientific and Technical Information

摘  要:为了在不利用词典的条件下实现对中文生物医学文本的有效切分,结合中文生物医学文本专业术语多、新术语不断出现和结构式摘要的特点,引入一种基于重现原理的无词典分词方法,并在实际应用过程中从分词长度上限值的设定和层次特征项抽取两方面对其进行了改进.实验结果表明,该方法可以在不需要词典和语料库学习的情况下,实现对生物医学文本中关键性专业术语的有效抽取,分词准确率约为84.51%.最后,基于本研究中的分词结果,对生物医学领域的词长分布进行了初步探讨,结果表明中文生物医学领域的词长分布与普通汉语文本有非常大的差异.研究结果对在处理中文生物医学文本时N-gram模型中N值的确定具有一定的参考价值.In order to segment Chinese biomedical text without thesaurus, combining with the characteristics of Chinese biomedical text, such as lots of specialized terms, new terms emerging and Structured Abstract, the paper introduces a method of Chinese word segmentation without thesaurus based on recurrence, and improves it in the process of practical ap-plication in two ways. First, do not set the upper limit of the length of terms, second, extracting terms and hierarchical terms at one time. Experimental results show that, without the help of thesaurus and corpus learning, the algorithm can extract the crucial specialized terms in the biomedical text effectively, and the Accuracy Rate is about 84. 51%. Finally, a preliminary study for the word length distribution in the field of biomedicine has been done, and the results prove that, the word length distribution in the field of Chinese biomedicine is very different from General Chinese's, it could provide reference for determining the value of N in N-gram model in the process of Chinese biomedical text.

关 键 词:无词典分词 结构式摘要 生物医学文本 

分 类 号:G35[文化科学—情报学]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象