基于多模态视频描述的中国手语识别

Chinese Sign Language Recognition Based on Multimodal Video Captioning

作　　者：袁甜甜[1] 杨学[1] YUAN Tian-tian;YANG Xue(Technical College for the Deaf/Tianjin University of Technology,Tianjin 300384,China)

机构地区：[1]天津理工大学聋人工学院,天津300384

出　　处：《山东农业大学学报（自然科学版）》2021年第1期143-148,共6页Journal of Shandong Agricultural University：Natural Science Edition

基　　金：天津市工业企业发展专项资金项目(201807111)。

摘　　要：计算机视觉是目前我国新一代人工智能科技发展的重要方向,手语识别因其在连续性、复杂场景干扰等问题上的困难,导致其研究不仅可以解决听障人对无障碍信息沟通的真实需要,还可极大的促进视频理解及分析领域的快速发展,从而在安防、智能监控等方面也有很好的落地应用。通过比较国内外多种基于视频描述和分析的手势识别方法,给出了视频手语识别和基于深度学习的视频描述的策略分析。对使用原始视频帧、视频光流和目前先进的姿态估计技术等方法进行了比较,进而提出适用于中国手语视频数据的多模态描述策略、训练模型架构及时空注意力模型。使用具有深度信息辅助的视频描述及训练方法,通过实验验证BLEU-4值可达52.3,较前期使用的基础方法提高约20%。但由于该方法所使用的深度信息在现实情况下并不容易获得,因此研究由手机或电脑摄像头获取的普通RGB视频的描述及识别方法是未来的发展方向。Computer vision is an important direction in the development of new generation Artificial Intelligence technology in our country at present.Because of its difficulties in continuity and complex scene interference,the research of sign language recognition can not only solve the real needs of deaf people for barrier-free information communication,but also greatly promote the rapid development of video understanding and analysis,so it has a good landing application in security,intelligent monitoring and so on.By comparing many gesture recognition methods based on video description and analysis,the strategies of sign language recognition and video description based on depth learning are given.The methods of using original video frame,video optical stream and advanced attitude estimation technology are compared,and then a multi-modal description strategy suitable for Chinese sign language video is proposed,and the training model architecture and attention model are proposed.Using the video description and training method assisted by depth information,the experimental results show that the BLEU-4 value can reach 52.3,which is about 20%higher than that of the baseline method.However,because the depth information used in this method is not easy to obtain in reality,it is the future direction to study the description and recognition method of ordinary RGB video obtained by mobile phone or computer camera.

关键词：手语识别视频描述多模态

分类号：TP387[自动化与计算机技术—计算机系统结构]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于多模态视频描述的中国手语识别

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于多模态视频描述的中国手语识别

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索