自动提取含字母词语的领域新术语的研究  被引量:3

Research on Automatic Extraction of Chinese New Domain-specific Terms Comprising Lettered-words

在线阅读下载全文

作  者:姜韶华[1] 党延忠[1] 

机构地区:[1]大连理工大学系统工程研究所,大连116024

出  处:《计算机工程》2007年第2期47-49,共3页Computer Engineering

基  金:国家自然科学基金资助项目(70271046)

摘  要:新术语的提取是中文信息处理领域的一个重要研究课题。针对现有提取方法的不足和很多专业术语表现为字母词语的特点,该文提出了一种综合统计技术和规则筛选的方法:基于长串优先和串频统计的思路进行文本切分,得到共现字符串,利用词语搭配规则进行过滤,经过领域词典及评价函数的筛选,提取出领域新术语。该方法可发现包含字母词语、专业术语等未登录词在内的频率大于等于2的任意长度的专指语义串、短语和词。实验表明了该方法的有效性及新术语的准确率分布特征。Extraction of new domain-specific terms is one of the important topics in Chinese natural language processing. Aiming at the limitation of the current methods and the specialties of many domain-specific terms are lettered-words, a novel approach combined with statistic technique and rule is proposed to extract new special semantic strings. Co-occurrence of character strings is formed by text segmentation based on matching longer strings first combined with frequency statistics. No-meaningful character strings are trimmed by collocation rules. Filtered by domain lexicon and membership degree, new domain-specific terms are extracted finally. This method can extract new special semantic strings, phrases and words, including unknown words like lettered-words and domain-specific terms, their frequency is larger than 2. Experiments show that this extraction technique is effective and indicate new domain-specific terms' distribution characteristic of precision ratio.

关 键 词:专指语义串 长串优先 字母词语 中文信息处理 

分 类 号:TP311[自动化与计算机技术—计算机软件与理论]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象