检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
出 处:《高技术通讯》2010年第12期1224-1228,共5页Chinese High Technology Letters
基 金:973计划(2007CB311004)资助项目
摘 要:为了克服用应用相关的文本数据进行语音识别、智能输入等各种自然语言处理中在有些情况下因很难收集到充足的相关数据和缺乏应用相关的训练数据带来的困难,提出了一种通过结合非监督文本聚类和文本检索技术实现相关语料选取的新方法。该方法仅使用少量与特定应用相关的文本,即可从未经整理的大规模语料库中发现更多与此应用相关的文本。利用该方法在手机短信文本和未经整理的大规模语料库上进行了实验,实验结果表明该方法能够有效提取应用相关的文本。In order to solve the difficulties brought about in some situations when using the application-relevant text data to do various natural language processings, such as automatic speech recognition and intelligent input due to the hard collection of relevant data and the scarcity of application-relevant training texts, this paper presents a novel method for corpus adap- tation by combining the unsupervised text clustering and text retrieval techniques. The method only uses a small set of ap- plication specific text to find the relevant text from a large scale of unorganized corpus, thereby, it adapts training corpus towards the application area of interest. The performance of the n-gram statistical language model, which was trained from the text retrieved and tested on the application-specific text, was used to evaluate the relevance of the text acquired. The preliminary experiments on short message texts and unorganized large corpus demonstrated the good performance of the proposed method.
关 键 词:文本聚类 文本检索 Kullback.Leibler距离 统计语言模型
分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.222