机构地区:[1]the Key Laboratory of Speech Acoustics and Content Understanding,Institute of Acoustics,Chinese Academy of Sciences,Beijing 100190,China [2]the School of Electronic,Electrical and Communication Engineering,University of Chinese Academy of Sciences,Beijing 101408,China [3]with Xinjiang Laboratory of Minority Speech and Language Information Processing,Xinjiang Technical Institute of Physics and Chemistry,Chinese Academy of Sciences,Urumqi 830011,China
出 处:《IEEE/CAA Journal of Automatica Sinica》2019年第5期1187-1195,共9页自动化学报(英文版)
基 金:partially supported by the National Natural Science Foundation of China(11590770-4,U1536117);the National Key Research and Development Program of China(2016YFB0801203,2016YFB0801200);the Key Science and Technology Project of the Xinjiang Uygur Autonomous Region(2016A03007-1);the Pre-research Project for Equipment of General Information System(JZX2017-0994/Y306)
摘 要:It is well known that automatic speech recognition(ASR) is a resource consuming task. It takes sufficient amount of data to train a state-of-the-art deep neural network acoustic model. As for some low-resource languages where scripted speech is difficult to obtain, data sparsity is the main problem that limits the performance of speech recognition system. In this paper, several knowledge transfer methods are investigated to overcome the data sparsity problem with the help of high-resource languages.The first one is a pre-training and fine-tuning(PT/FT) method, in which the parameters of hidden layers are initialized with a welltrained neural network. Secondly, the progressive neural networks(Prognets) are investigated. With the help of lateral connections in the network architecture, Prognets are immune to forgetting effect and superior in knowledge transferring. Finally,bottleneck features(BNF) are extracted using cross-lingual deep neural networks and serves as an enhanced feature to improve the performance of ASR system. Experiments are conducted in a low-resource Vietnamese dataset. The results show that all three methods yield significant gains over the baseline system, and the Prognets acoustic model performs the best. Further improvements can be obtained by combining the Prognets model and bottleneck features.It is well known that automatic speech recognition(ASR) is a resource consuming task. It takes sufficient amount of data to train a state-of-the-art deep neural network acoustic model. As for some low-resource languages where scripted speech is difficult to obtain, data sparsity is the main problem that limits the performance of speech recognition system. In this paper, several knowledge transfer methods are investigated to overcome the data sparsity problem with the help of high-resource languages.The first one is a pre-training and fine-tuning(PT/FT) method, in which the parameters of hidden layers are initialized with a welltrained neural network. Secondly, the progressive neural networks(Prognets) are investigated. With the help of lateral connections in the network architecture, Prognets are immune to forgetting effect and superior in knowledge transferring. Finally,bottleneck features(BNF) are extracted using cross-lingual deep neural networks and serves as an enhanced feature to improve the performance of ASR system. Experiments are conducted in a low-resource Vietnamese dataset. The results show that all three methods yield significant gains over the baseline system, and the Prognets acoustic model performs the best. Further improvements can be obtained by combining the Prognets model and bottleneck features.
关 键 词:BOTTLENECK feature (BNF) cross-lingual automatic speech recognition (ASR) PROGRESSIVE neural networks (Prognets) model transfer learning
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...