基于离散小波变换及高低频子带解耦的低计算资源占用端到端语音识别  

Low computational cost end-to-end speech recognition based on discrete wavelet transform and subband decoupling

在线阅读下载全文

作  者:田三力 黎塔 叶凌轩 吴石松 赵庆卫[1,2] 张鹏远 TIAN Sanli;LI Ta;YE Lingxuan;WU Shisong;ZHAO Qingwei;ZHANG Pengyuan(Key Laboratory of Speech Acoustics and Content Understanding,Institute of Acoustics,Chinese Academy of Sciences,Beijing 100190;University of Chinese Academy of Sciences,Beijing 100049;China Southern Power Grid Artificial Intelligence Technology Co.,Ltd.,Guangzhou 510000)

机构地区:[1]中国科学院声学研究所语音与智能信息处理实验室,北京100190 [2]中国科学院大学,北京100049 [3]南方电网人工智能科技有限公司,广州510000

出  处:《声学学报》2025年第2期373-383,共11页Acta Acustica

基  金:科技创新2030项目(2022ZD0116103)资助。

摘  要:针对目前端到端语音识别模型计算资源占用过高的问题,提出了一种将离散小波变换(DWT)与端到端语音识别相融合的方法 (WLformer),大幅降低计算资源占用量的同时还可提升识别性能。WLformer的构建以目前端到端语音识别中广泛使用的Conformer模型为基础,在模型中引入所提出的基于DWT的信号压缩模块,该模块通过去除模型中间层表征内信息量较少的高频成分从而对该表征进行压缩,进而降低模型的计算资源占用。此外还提出了DWT子带解耦前馈网络的子模块结构以替换原模型中部分前馈网络,从而进一步降低模型的计算量。在Aishell-1、HKUST和LibriSpeech三个常用的中英文数据集上的实验表明,提出的WLformer相较于Conformer的显存占用相对下降47.4%,计算量Gflops相对下降39.2%,同时还获得了平均13.1%的错误率改善。此外, WLformer在计算资源占用少于其他主流端到端语音识别模型的情况下同样取得了更好的识别性能,进一步验证了所提方法的有效性。To solve the problem of high computational cost of the current end-to-end automatic speech recognition(E2E ASR),a method(WLformer)that integrates discrete wavelet transform(DWT)with E2E ASR is proposed,which can significantly reduce the computing resource usage while improving performance.WLformer is built upon the mostly used Conformer model.WLformer introduces the proposed DWT Signal Compression Module,which compresses the model’s middle hidden representation by removing its high-frequency components with less information.In addition,a new module structure named DWT Subband Decoupling Feed-Forward Network(DSD-FFN)is proposed to further reduce the model’s computational cost.Experiments are conducted on Aishell-1,HKUST,and LibriSpeech datasets.The results show that WLformer achieves 47.4%relative memory usage reduction and 39.2%relative Gflops reduction,and achieves an average 13.1%relative character/word error rate reduction compared to Conformer.In addition,WLformer also achieves better recognition performance while occupying fewer computing resources than other mainstream E2E ASR models,which further verifies its effectiveness.

关 键 词:语音识别 离散小波变换 低计算资源占用 端侧部署 

分 类 号:TN912.34[电子电信—通信与信息系统]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象