Audio-visual keyword transformer for unconstrained sentence-level keyword spotting  

在线阅读下载全文

作  者:Yidi Li Jiale Ren Yawei Wang Guoquan Wang Xia Li Hong Liu 

机构地区:[1]Key Laboratory of Machine Perception,Peking University,Shenzhen Graduate School,Shenzhen,China [2]College of Electronics and Information Engineering,Sichuan University,Chengdu,China [3]Department of Computer Science,ETH Zurich,Zurich,Switzerland

出  处:《CAAI Transactions on Intelligence Technology》2024年第1期142-152,共11页智能技术学报(英文)

基  金:Science and Technology Plan of Shenzhen,Grant/Award Number:JCYJ20200109140410340;National Natural Science Foundation of China,Grant/Award Number:62073004。

摘  要:As one of the most effective methods to improve the accuracy and robustness of speech tasks,the audio-visual fusion approach has recently been introduced into the field of Keyword Spotting(KWS).However,existing audio-visual keyword spotting models are limited to detecting isolated words,while keyword spotting for unconstrained speech is still a challenging problem.To this end,an Audio-Visual Keyword Transformer(AVKT)network is proposed to spot keywords in unconstrained video clips.The authors present a transformer classifier with learnable CLS tokens to extract distinctive keyword features from the variable-length audio and visual inputs.The outputs of audio and visual branches are combined in a decision fusion module.As humans can easily notice whether a keyword appears in a sentence or not,our AVKT network can detect whether a video clip with a spoken sentence contains a pre-specified keyword.Moreover,the position of the keyword is localised in the attention map without additional position labels.Exper-imental results on the LRS2-KWS dataset and our newly collected PKU-KWS dataset show that the accuracy of AVKT exceeded 99%in clean scenes and 85%in extremely noisy conditions.The code is available at https://github.com/jialeren/AVKT.

关 键 词:artificial intelligence multimodal approaches natural language processing neural network speech processing 

分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象