基于词覆盖率的语音数据集最小化方法

Minimizing Speech Datasets Based on Word Coverage Rate

作　　者：朱治军付磊 ZHU Zhijun;FU Lei(Qingshan District Branch of Wuhan Public Security Bureau(Gangcheng Branch),Wuhan 430080,China;Shenzhen Huawei Cloud Computing Technology Co.,LTD.,Shenzhen 518129,China)

机构地区：[1]武汉市公安局青山区分局(钢城分局),湖北武汉430080 [2]深圳华为云计算技术有限公司,广东深圳518129

出　　处：《软件导刊》2024年第5期33-37,共5页Software Guide

摘　　要：为解决高性能自动语音识别模型训练集采集成本高和训练成本高的问题,提出一种基于词覆盖率的语音训练集最小化方法,尽可能减少训练集所需的数据规模。该方法引入向量空间模型的概念,将所有语料文本映射到高维空间,通过计算向量之间的余弦距离来筛选相似度最低的文本数据。然后,根据选择的文本数据收集音频,实现使用尽可能少的音频数据达到最佳的识别效果。最后,使用汉明重叠方式计算新增词汇量以评估贡献度,从而优化余弦距离的筛选方式。实验表明,所提方法相较于随机的语音训练集筛选方法,在节省21.31%训练数据量的情况下可达到相同词覆盖率,并且训练集的词覆盖率与训练集所得模型的推理性能存在极强的正相关性,证明了在保持推理性能接近的前提下,可有效节省语音训练集的采集和训练成本,进而促进自动语音识别技术的进一步发展。To address the issue of high collection and training costs for training high-performance automatic speech recognition models,a method based on word coverage is proposed to minimize the data size required for the training set.This method introduces the concept of vector space models,mapping all corpus texts to a high-dimensional space,and selecting the text data with the lowest similarity by calculating the cosine distance between vectors.Then,collect audio based on the selected text data to achieve the best recognition effect using as little audio data as possible.Finally,using Hamming overlap to calculate the amount of newly added vocabulary to evaluate contribution,in order to optimize the selection method of cosine distance.The experiment shows that compared to the random speech training set filtering method,the proposed method can achieve the same word coverage while saving 21.31%of training data,and there is a strong positive correlation between the word coverage of the training set and the inference performance of the model obtained from the training set.This proves that while maintaining similar inference performance,it can effectively save the collection and training costs of the speech training set,thereby promoting the further development of automatic speech recognition technology.

关键词：自动语音识别向量空间模型余弦距离汉明重量训练集最小化

分类号：TP391[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于词覆盖率的语音数据集最小化方法

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于词覆盖率的语音数据集最小化方法

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索