Audiovisual speech recognition based on a deep convolutional neural network  

在线阅读下载全文

作  者:Shashidhar Rudregowda Sudarshan Patilkulkarni Vinayakumar Ravi Gururaj H.L. Moez Krichen 

机构地区:[1]Department of Electronics and Communication Engineering,JSS Science and Technology University,Mysuru,570006,India [2]Center for Artificial Intelligence,Prince Mohammad Bin Fahd University,Khobar,34754,Saudi Arabia [3]Department of Information Technology,Manipal Institute of Technology Bengaluru,Manipal Academy of Higher Education,Manipal,560064,India [4]Department of Information Technology,Faculty of Computer Science and Information Technology(FCSIT),Al-Baha University,Alaqiq,65779-7738,Saudi Arabia [5]ReDCAD Laboratory,University of Sfax,Sfax,3038,Tunisia

出  处:《Data Science and Management》2024年第1期25-34,共10页数据科学与管理(英文)

摘  要:Audiovisual speech recognition is an emerging research topic.Lipreading is the recognition of what someone is saying using visual information,primarily lip movements.In this study,we created a custom dataset for Indian English linguistics and categorized it into three main categories:(1)audio recognition,(2)visual feature extraction,and(3)combined audio and visual recognition.Audio features were extracted using the mel-frequency cepstral coefficient,and classification was performed using a one-dimension convolutional neural network.Visual feature extraction uses Dlib and then classifies visual speech using a long short-term memory type of recurrent neural networks.Finally,integration was performed using a deep convolutional network.The audio speech of Indian English was successfully recognized with accuracies of 93.67%and 91.53%,respectively,using testing data from 200 epochs.The training accuracy for visual speech recognition using the Indian English dataset was 77.48%and the test accuracy was 76.19%using 60 epochs.After integration,the accuracies of audiovisual speech recognition using the Indian English dataset for training and testing were 94.67%and 91.75%,respectively.

关 键 词:Audiovisual speech recognition Custom dataset 1D Convolution neural network(CNN) Deep CNN(DCNN) Long short-term memory(LSTM) LIPREADING Dlib Mel-frequency cepstral coefficient(MFCC) 

分 类 号:TP3[自动化与计算机技术—计算机科学与技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象