出 处:《中国图象图形学报》2024年第5期1392-1407,共16页Journal of Image and Graphics
基 金:国家自然科学基金项目(61572162,61802095);浙江省重点研发计划“领雁”项目(2023C01145);浙江省自然科学基金项目(LQ17F020003)。
摘 要:目的 动作识别在工业生产制造中变得越来越重要。但在复杂的生产车间内,动作识别技术会受到环境遮挡、视角变化以及相似动作识别等干扰。基于此,提出一种结合双视图骨架多流网络的装箱行为识别方法。方法 将堆叠的差分图像(residual frames, RF)作为模型的输入,结合多视图模块解决人体被遮挡的问题。在视角转换模块中,将差分人体骨架旋转到最佳的虚拟观察角度,并将转换后的骨架数据传入3层堆叠的长短时记忆网络(long short-term memory, LSTM)中,将不同视角下的分类分数进行融合,得到识别结果。为了解决细微动作的识别问题,采用结合注意力机制的局部定位图像卷积网络,传入到卷积神经网络中进行识别。融合骨架和局部图像识别的结果,预测工人的行为动作。结果 在实际生产环境下的装箱场景中进行了实验,得到装箱行为识别准确率为92.31%,较大幅度领先于现有的主流行为识别方式。此外,该方法在公共数据集NTU(Nanyang Technological University) RGB+D上进行了评估,结果显示在CS(cross-subject)协议和CV(cross-view)协议中的性能分别达到了85.52%和93.64%,优于其他网络,进一步验证了本文方法的有效性和准确性。结论 本文提出了一种人体行为识别方法,能够充分利用多个视图中的人体行为信息,采用骨架网络和卷积神经网络模型相结合的方式,有效提高了行为识别的准确率。Objective Action recognition has become increasingly important in industrial manufacturing.Production effi⁃ciency and quality can be improved by recognizing worker actions and postures in complex production environments.In recent years,action recognition based on skeletal data has received widespread attention and research,with methods mainly based on graph convolutional networks(GCN)or long short-term memory(LSTM)networks exhibiting excellent recognition performance in experiments.However,these methods have not considered the recognition problems of occlu⁃sion,viewpoint changes,and similar subtle actions in the factory environment,which may have a significant impact on subsequent action recognition. Therefore, this study proposes a packing behavior recognition method that combines a dualview skeleton multi-stream network. Method The network model consists of a main network and a sub-network. The mainnetwork uses two RGB videos from different perspectives as input and records the input of workers at the same time andaction. Subsequently, the image difference method is used to convert the input video data into a difference image. More⁃over, the 3D skeleton information of the character is extracted from the depth map by using the 3D pose estimation algo⁃rithm and then transmitted to the subsequent viewing angle conversion module. In the perspective conversion module, therotation of the bone data is used to find the best viewing angle, and the converted skeleton data are passed into a three-layerstacked LSTM network. The different classification scores of the weighted fusion are obtained for the recognition results ofthe main network. In addition, for some similar behaviors and non-compliant “fake actions”, we use a local positioningimage convolution network combined with an attention mechanism and pass it into the ResNeXt network for recognition.Moreover, we introduce a spatio-temporal attention mechanism for analyzing video action recognition sequences to focus onthe key frames of the skeleton sequence
关 键 词:动作识别 长短时记忆网络(LSTM) 双视图 自适应视图转换 注意力机制
分 类 号:TP399[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...