语音文本对齐技术构建蒙古语语音识别语料库研究  

Research on the Construction of Mongolian Speech Recognition Corpus Based on Speech-Text Alignment Technology

在线阅读下载全文

作  者:甄兆博 张晖 ZHEN Zhaobo;ZHANG Hui(National&Local Joint Engineering Research Center of Intelligent Information Processing Technology for Mongolian,Hohhot 010020,China;Inner Mongolia Key Laboratory of Mongolian Information Processing Technology,Hohhot 010021,China;College of Computer Science,Inner Mongolia University,Hohhot 010021,China)

机构地区:[1]蒙古文智能信息处理技术国家地方联合工程研究中心,内蒙古呼和浩特010020 [2]内蒙古自治区蒙古文信息处理技术重点实验室,内蒙古呼和浩特010020 [3]内蒙古大学计算机学院,内蒙古呼和浩特010021

出  处:《中央民族大学学报(自然科学版)》2024年第1期12-19,共8页Journal of Minzu University of China(Natural Sciences Edition)

摘  要:目前,适用于蒙古语的语音识别数据在规模上与英语、汉语的训练数据存在着巨大的差距。因此需要一种低成本的数据集构建方法,以补全数据来源上的短板。在生活交往中已生成了海量的蒙古语数据资源,其中很多都是语音文本粗略对照的形式,本研究采用从这样的语料中提炼可供训练用的语料的技术路线,选择电视剧配音剧本和对应成片作为样例,将提炼工作看作是一个语音文本对齐问题。通过一系列自动化处理将剧本和对应的音频转换为适用于语音文本对齐处理的数据形式,利用迭代的对齐方法得到了语音文本对齐结果,利用这些结果生成了适用于蒙古语语音识别的逐句对齐的“语音—文本对”数据。通过对生成的数据进行抽样检查发现,生成的数据有较好的质量,与人工标注基本一致,节省了数据生产的成本。At present,there is a huge gap between the speech recognition data applicable to Mongolian and the training data of English and Chinese in terms of scale.Therefore,a low-cost dataset construction method is needed to make up for the shortcomings in data.Considering the huge amount of Mongolian language data resources generated in life interactions,many of them are in the form of rough controls of speech texts.The experiments adopt the technical route of extracting an annotated corpus from the raw corpus that can be used for training,and the TV dubbing script and the corresponding finished film are selected as samples of such a raw corpus.The raw corpus refinement is considered as a phonetic text alignment problem.Through a series of automated processes,the script and the corresponding audio are converted into a data form suitable for speech-text alignment processing,and an iterative alignment method is used to obtain the speech-text alignment results,thus generating "speech-text pairs" for Mongolian speech recognition.A sample check of the generated data reveale that the generated data has good quality and is basically consistent with manual annotation,saving the cost of data production.

关 键 词:语音识别 蒙古语 生语料 语音文本对齐 

分 类 号:TN391[电子电信—物理电子学]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象