检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
机构地区:[1]云南大学信息学院,昆明650091
出 处:《信息技术》2006年第8期17-20,共4页Information Technology
基 金:国家自然科学基金项目(60265001)
摘 要:为语音识别系统构建语言模型,首先要进行语料准备,语料来源决定语言模型的性能。Web网页中涵盖了各种最新的语言现象,为语料准备提供了最多样化的资源。但Web网页中语义完整字串通常夹杂在格式、标记、广告等无用字串中。首先介绍语言模型的训练算法和更新方法,继而提出一种从HTML文档提取用于训练语言模型的语义完整汉字字串的算法,最后给出语料提取实验结果、语言模型训练结果和语言模型的动态更新结果。为基于Web网页语料动态更新语言模型提供了一个完整的解决方案。A statistical n- gram language model (LM) is used to predict each language symbol in the sequence given its n - 1 predecessors. The first stage of constructing an n - gram LM for speech recognition system is gathering training text set. Web is a vast repository of information and a very important resource of the training text set for updating LM. However, the HTML documents downloaded from the Web include a lot of redundant text for training LM, such as format, tags and advertisements. In this paper, a new algorithm to automatically extract the Chinese training text from HTML documents is introduced. Based on the algorithm bout 93MB training text set is extracted from Webs and a baseline 3 - gram LM is constructed using this text set. To verify that the updating LM based on the Web is effective, another training text set, about 14MB and from Webs, is used to update the baseline LM. In addition, the perplexity and the OOV (Out of Vocabulary) of the baseline LM and the updated LM are estimated, respectively.
分 类 号:TN912.3[电子电信—通信与信息系统]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.117