一种基于生语料的领域词典生成方法  被引量:11

Method of Special Domain Lexicon Construction Based on Raw Materials

在线阅读下载全文

作  者:孙霞[1] 郑庆华[1] 王朝静[1] 张素娟[1] 

机构地区:[1]西安交通大学计算机系,陕西西安710049

出  处:《小型微型计算机系统》2005年第6期1088-1092,共5页Journal of Chinese Computer Systems

基  金:国家自然科学基金项目(60373105)资助;国家"十五"重大科技攻关项目(2001BA101A01)资助;教育部优秀青年教师基金项目资助.

摘  要:为了实现准确分词,实用的汉语信息处理系统都需有其专用的领域词典.针对现有词典构造方法存在的不足,本文提出了一种领域词典的构造方法:利用通用词典对领域生语料进行分词处理,并提出了基于切分单元的最大匹配算法,从而得到候选词串集,然后利用规则对其进行优化,最终生成领域词典.词典的生成过程基本上是自动完成的,人工干预少,易于更新;目前,本方法生成的领域词典已经应用于我们自主开发的"基于Web的智能答疑系统"中,并取得了较好的效果.Special domain lexicon is very vital to any practical Chinese information processing system, especially to Chinese word segmentation. Aiming at the limitation of the current methods of special domain lexicon construction, a novel Chinese lexicon construction approach for word segmentation is proposed in this paper. It is based on a large amount of raw materials for some one special domain collected ahead, the longest repeated string patterns are extracted from each raw material after word segmentation based on open domain lexicon. Then, the non-meaningful words are trimmed to improve word extraction accuracy from possible candidate word set, moreover, using some optimization rules to filter the non-meaningful words further and finally the special domain lexicon is constructed. The proposed method has already been implemented and applied in our Web answering system. The experimental result shows it is practical, effective and extendable.

关 键 词:领域词典 通用词典 词频统计 最大匹配 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象