基于Web网页语料构建动态语言模型  被引量:1

Updating language model based on training text from Webs

在线阅读下载全文

作  者:李雪涛[1] 文茂平[1] 杨鉴[1] 

机构地区:[1]云南大学信息学院,昆明650091

出  处:《信息技术》2006年第8期17-20,共4页Information Technology

基  金:国家自然科学基金项目(60265001)

摘  要:为语音识别系统构建语言模型,首先要进行语料准备,语料来源决定语言模型的性能。Web网页中涵盖了各种最新的语言现象,为语料准备提供了最多样化的资源。但Web网页中语义完整字串通常夹杂在格式、标记、广告等无用字串中。首先介绍语言模型的训练算法和更新方法,继而提出一种从HTML文档提取用于训练语言模型的语义完整汉字字串的算法,最后给出语料提取实验结果、语言模型训练结果和语言模型的动态更新结果。为基于Web网页语料动态更新语言模型提供了一个完整的解决方案。A statistical n- gram language model (LM) is used to predict each language symbol in the sequence given its n - 1 predecessors. The first stage of constructing an n - gram LM for speech recognition system is gathering training text set. Web is a vast repository of information and a very important resource of the training text set for updating LM. However, the HTML documents downloaded from the Web include a lot of redundant text for training LM, such as format, tags and advertisements. In this paper, a new algorithm to automatically extract the Chinese training text from HTML documents is introduced. Based on the algorithm bout 93MB training text set is extracted from Webs and a baseline 3 - gram LM is constructed using this text set. To verify that the updating LM based on the Web is effective, another training text set, about 14MB and from Webs, is used to update the baseline LM. In addition, the perplexity and the OOV (Out of Vocabulary) of the baseline LM and the updated LM are estimated, respectively.

关 键 词:语言模型 语料库 信息提取 动态更新 

分 类 号:TN912.3[电子电信—通信与信息系统]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象