Action recognition based on two-stream spatio-temporal residual attention network

Authors

  • Jingjing Han
  • Hua Huo

DOI:

https://doi.org/10.54097/jceim.v10i2.7876

Keywords:

Action recognition, Attention mechanism, Two-stream network, Feature fusion

Abstract

Aiming at the problems that there are few video action samples in the human action recognition process, overfitting in the training process, poor extraction ability of network models for key features, and low recognition accuracy. In this paper, we propose a method of action recognition based on a two-stream spatio-temporal residual attention network. First of all, by adding the Convolutional Block Attention Module (CBAM), we strengthen the model’s channel and spatial feature extraction of video and suppress interfering information. Based on the output of each stream, train two convolutional neural networks to model static spatial information and motion information respectively, in order to combine the outputs of the two networks, we use weighted fusion to generate the final action recognition results. A large number of experiments were conducted in UCF101 and HMDB51 datasets, and the recognition accuracy reached 92.9% and 73.66%, respectively. The experimental results showed that the model can make full use of the spatio-temporal information in the video and extract the key information of motion.

References

Fan Z, Zhao X, Lin T, et al. Attention-based multiview re-observation fusion network for skeletal action recognition [ J ]. IEEE Transactions on Multimedia. Vol. 21(2019) No. 2, p. 363-374.

LIAO Z,HU H,LIU Y. Action recognition with multiple relative descriptors of trajectories[ J]. Neural Processing Letters,2020,51(1):287-302.

SHI L,ZHANG Y,CHENG J,et al. Two-stream adaptive graph convolutional networks for skeleton-based action recognition[C] / / Proceedings of 2019 IEEE Conference on Computer Vision and Pattern Recognition( CVPR). Long Beach:IEEE,2019:12018-12027.

LI T, FAN L, ZHAO M, et al. Making the invisible visible:action recognition through walls and occlusions [ C ] / / Proceedings of 2019 IEEE International Conference on Computer Vision ( ICCV). Seoul: IEEE, 2019:872-881.

Wang H , Klser A , Schmid C , et al. Dense trajectories and motion boundary descriptors for action recognition[J]. International Journal of Computer Vision, 2013(1):103.

Laptev I, Lindeberg T. On space-time interest points[J]. International Journal of Computer Vision, 2005, 64(2-3): 107-124.

Chaudhry R , Ravichandran A , Hager G , et al. Histograms of oriented optical flow and Binet-Cauchy kernels on nonlinear dynamical systems for the recognition of human actions[C]// IEEE Conference on Computer Vision & Pattern Recognition. IEEE, 2009.

Bobick A F , Davis J W . The recognition of human movement using temporal templates[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2001, 23(3):257-267.

Simonyan K, Zisserman A. Two-stream convolutional networks for action recognition in videos[J]. Advances in Neural Information Processing Systems, 2014, 27.

Wang L, Xiong Y, Wang Z, et al. Temporal segment networks: Towards good practices for deep action recognition[C]. European Conference on Computer Vision. Springer, Cham, 2016: 20-36.

Fischer P , Dosovitskiy A , Ilg E , et al. FlowNet: Learning Optical Flow with Convolutional Networks[J]. IEEE, 2016.

Hochreiter S , Schmidhuber J . Long short-term memory[J]. Neural Computation, 1997, 9(8):1735-1780.

Yue-Hei Ng J, Hausknecht M, Vijayanarasimhan S, et al. Beyond short snippets: Deep networks for video classification[C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015: 4694-4702.

Ballas N, Yao L, Pal C, et al. Delving deeper into convolutional networks for learning video representations[J]. arXiv preprint arXiv:1511.06432, 2015.

Gammulle H, Denman S, Sridharan S, et al. Two stream lstm: A deep fusion framework for human action recognition[C]. 2017 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2017: 177-186.

Li Z, Gavrilyuk K, Gavves E, et al. Videolstm convolves, attends and flows for action recognition[J]. Computer Vision and Image Understanding, 2018, 166: 41-50.

Donahue J, Anne Hendricks L, Guadarrama S, et al. Long-term recurrent convolutional networks for visual recognition and description[C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015: 2625-2634.

Tran D, Bourdev L, Fergus R, et al. Learning spatiotemporal features with 3d convolutional networks[C]. Proceedings of the IEEE International Conference on Computer Vision, 2015: 4489-4497.

Du T , Ray J , Shou Z , et al. ConvNet Architecture Search for Spatiotemporal Feature Learning[J]. 2017.

Carreira J, Zisserman A. Quo Vadis, action recognition? A new model and the kinetics dataset[C]. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, 2017: 6299-6308.

Qiu Z, Yao T, Mei T. Learning spatio-temporal representation with pseudo-3d residual networks[C]. Proceedings of the IEEE International Conference on Computer Vision, 2017: 5533-5541.

Du T , Wang H , Torresani L , et al. A Closer Look at Spatiotemporal Convolutions for Action Recognition[J].

Sun L , Jia K , Yeung D Y , et al. Human Action Recognition Using Factorized Spatio-Temporal Convolutional Networks: IEEE, 10.1109/ICCV.2015.522[P]. 2015.

Zhou Y , Sun X , Zha Z J, et al. MiCT: Mixed 3D/2D Convolutional Tube for Human Action Recognition[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2018.

Hu Jie, Shen Li, Albanie S, et al.Squeeze-and-excitation networks [J].IEEE Trans on Pattern Analysis and Machine Intelligence, 2020, 42(8): 2011-2023.

ZHANG J, XIE Y, XIA Y, et al. Attention residual learning for skin lesion classification[J]. IEEE Transactions on Medical Imaging, 2019, 38(9): 2092-2103.

MAX J, SIMONYAN K, ZISSERMAN A. Spatial transformer networks[C]//Proceedings of ECCVʼ15. Berlin, Germany: Springer, 2015: 2017-2025.

Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. In ICCV, pages 4489–4497, 2015.

Diba A, Fayyaz M, Sharma V, et al. Temporal 3D Convnets: New Architecture and Transfer Learning for Video Classification[J]. arXiv preprint arXiv: 1711.08200, 2017.

Jialin Wu,Gu Wang,Wukui Yang,Xiangyang Ji. Action Recognition with Joint Attention on Multi-Level Deep Features.[J]. CoRR,2016,abs/1607.02556.

Wang Y , Wang S , Tang J , et al. Hierarchical Attention Network for Action Recognition in Videos[J]. 2016.

Zang, J. et al. Attention-based Temporal Weighted Convolutional Neural Network for Action Recognition. (2018).

Li, Z., Gavrilyuk, K., Gavves, E., Jain, M. & Snoek, C. G. M. VideoLSTM convolves, attends and flows for action recognition. Computer Vision and Image Understanding 166, 41–50 (2018).

Varol G, Laptev I, Schmid C. Long-Term Temporal Convolutions for Action Recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(6): 1510-1517.

Ge H W, Yan Z H, Yu W H, et al. An Attention Mechanism Based Convolutional LSTM Network for Video Action Recognition[J]. Multimedia Tools and Applications, 2019, 78(14): 1-24.

Song S, Liu J, Li Y, Guo Z (2020) Modality compensation network: Cross-modal adaptation for action recognition. IEEE Trans Image Process 29:3957–3969.

Tran A, Cheong L-F (2017) Two-stream fow-guided convolutional attention networks for action recognition. In: Proceedings of the IEEE international conference on computer vision. pp 3110–3119.

Downloads

Published

23-04-2023

Issue

Section

Articles

How to Cite

Han, J., & Huo, H. (2023). Action recognition based on two-stream spatio-temporal residual attention network. Journal of Computing and Electronic Information Management, 10(2), 45-51. https://doi.org/10.54097/jceim.v10i2.7876