STDNet:Improved lip reading via short-term temporal dependency modeling  

在线阅读下载全文

作  者:Xiaoer WU Zhenhua TAN Ziwei CHENG Yuran RU 

机构地区:[1]Software College,Northeastern University,Shenyang 110819,China [2]Faculty of Software College,Northeastern University,Shenyang 110819,China

出  处:《虚拟现实与智能硬件(中英文)》2025年第2期173-187,共15页Virtual Reality & Intelligent Hardware

基  金:Supported by the National Key Research and Development Program of China(2023YFC3306201);the National Natural Science Foundation of China(61772125);the Fundamental Research Funds for the Central Universities(N2317004).

摘  要:Background Lip reading uses lip images for visual speech recognition.Deep-learning-based lip reading has greatly improved performance in current datasets;however,most existing research ignores the significance of short-term temporal dependencies of lip-shape variations between adjacent frames,which leaves space for further improvement in feature extraction.Methods This article presents a spatiotemporal feature fusion network(STDNet)that compensates for the deficiencies of current lip-reading approaches in short-term temporal dependency modeling.Specifically,to distinguish more similar and intricate content,STDNet adds a temporal feature extraction branch based on a 3D-CNN,which enhances the learning of dynamic lip movements in adjacent frames while not affecting spatial feature extraction.In particular,we designed a local–temporal block,which aggregates interframe differences,strengthening the relationship between various local lip regions through multiscale convolution.We incorporated the squeeze-and-excitation mechanism into the Global-Temporal Block,which processes a single frame as an independent unitto learn temporal variations across the entire lip region more effectively.Furthermore,attention pooling was introduced to highlight meaningful frames containing key semantic information for the target word.Results Experimental results demonstrated STDNet's superior performance on the LRW and LRW-1000,achieving word-level recognition accuracies of 90.2% and 53.56%,respectively.Extensive ablation experiments verified the rationality and effectiveness of its modules.Conclusions The proposed model effectively addresses short-term temporal dependency limitations in lip reading,and improves the temporal robustness of the model against variable-length sequences.These advancements validate the importance of explicit short-term dynamics modeling for practical lip-reading systems.

关 键 词:Lip reading Spatio-temporal feature fusion Short-term temporal dependency modeling 

分 类 号:O17[理学—数学]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象