结合文本聚类和文本检索的语料选取方法  

Combining text clustering and text retrieval for corpus adaptation

在线阅读下载全文

作  者:何峰[1] 丁晓青[1] 

机构地区:[1]清华大学电子工程系,北京100084

出  处:《高技术通讯》2010年第12期1224-1228,共5页Chinese High Technology Letters

基  金:973计划(2007CB311004)资助项目

摘  要:为了克服用应用相关的文本数据进行语音识别、智能输入等各种自然语言处理中在有些情况下因很难收集到充足的相关数据和缺乏应用相关的训练数据带来的困难,提出了一种通过结合非监督文本聚类和文本检索技术实现相关语料选取的新方法。该方法仅使用少量与特定应用相关的文本,即可从未经整理的大规模语料库中发现更多与此应用相关的文本。利用该方法在手机短信文本和未经整理的大规模语料库上进行了实验,实验结果表明该方法能够有效提取应用相关的文本。In order to solve the difficulties brought about in some situations when using the application-relevant text data to do various natural language processings, such as automatic speech recognition and intelligent input due to the hard collection of relevant data and the scarcity of application-relevant training texts, this paper presents a novel method for corpus adap- tation by combining the unsupervised text clustering and text retrieval techniques. The method only uses a small set of ap- plication specific text to find the relevant text from a large scale of unorganized corpus, thereby, it adapts training corpus towards the application area of interest. The performance of the n-gram statistical language model, which was trained from the text retrieved and tested on the application-specific text, was used to evaluate the relevance of the text acquired. The preliminary experiments on short message texts and unorganized large corpus demonstrated the good performance of the proposed method.

关 键 词:文本聚类 文本检索 Kullback.Leibler距离 统计语言模型 

分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象