一种基于字同现频率的汉语文本主题抽取方法  被引量:48

A Novel Chinese Text Subject Extraction Method Based on Character Co-occurrence

在线阅读下载全文

作  者:马颖华[1] 王永成[1] 苏贵洋[1] 张宇萌[1] 

机构地区:[1]上海交通大学计算机科学与工程系,上海200030

出  处:《计算机研究与发展》2003年第6期874-878,共5页Journal of Computer Research and Development

基  金:国家自然科学基金 ( 60 0 82 0 0 3)

摘  要:主题抽取是文本自动处理的基础工作之一 ,而主题的抽取一直以分词或者抽词作为第 1步 由于汉语词间缺少明显的间隔 ,因此分词和抽词的效果往往不够理想 ,从而在一定程度上影响了主题抽取的质量 提出以字为处理单位 ,基于字同现频率的汉语文本主题自动抽取的新方法 该方法速度快 ,适应多种文体类型 ,并完全避开了分词和抽词过程 ,可以广泛应用在主题句、主题段落等主题抽取的多个层面 ,而且同样适用于其他语言的文本主题抽取 主题句自动抽取实验表明 ,该方法抽取新闻文本主题句的正确率达到 77 19% 汉语文本的主题抽取比较实验还表明 。Subject extraction is one of the fundamental works of natural language processing. Word segmentation or word extraction is always the first step of subject extraction. As there is no intervals among words in Chinese text, both word segmentation and word extraction are difficult. In this paper, a novel Chinese text subject extraction method based on character co-occurrence is put forward. Neither word segmentation nor word extraction is required in this method. The method has high processing speed and can be used in both subject sentence extraction and subject paragraph extraction. Another advantage of this approach is that it can be used to process not only Chinese text but also text in other languages and even multi-language text. Results of experiments show that the approach gains high accuracy of 77.19% in multi-style text of news. And without word segmentation, the accuracy does not decline.

关 键 词:自然语言处理 主题抽取 同现频率 

分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象