Lightweight Dynamic Gesture Recognition based on shufflenetv2-Mamba Hybrid Architecture

Jiaxuan Chai; Mingge Sun; Dongxuan Huang; Sen Ye

doi:10.54097/7ms8ar63

Authors

Jiaxuan Chai
Mingge Sun
Dongxuan Huang
Sen Ye

DOI:

https://doi.org/10.54097/7ms8ar63

Keywords:

Dynamic Gesture Recognition, Lightweight Model, ShufflenetV2, Mamba, Spatio Temporal Feature Fusion

Abstract

Dynamic gesture recognition has important application value in human-computer interaction of mobile terminal, but the existing methods generally face the problems of high computational complexity and insufficient time sequence modeling ability. Therefore, this paper proposes a lightweight dynamic gesture recognition model based on shufflenetv2 Mamba (Shuma) hybrid architecture. In this model, Mamba's state space sequence modeling module is embedded into the shufflenetv2 backbone network to achieve efficient spatio-temporal feature fusion. First, part of the convolution operation is replaced in the downsampling bottleneck layer of shufflenetv2, and Mamba's linear complexity is used to capture the long-range dependence between video frames; Secondly, a multi-scale feature dynamic fusion mechanism is designed, which combines channel shuffle and cross layer feature stitching to enhance the collaborative representation ability of local details and global motion patterns of continuous gestures. In order to further optimize the deployment efficiency, layered quantization and structured pruning technology are introduced to compress the model parameters to 2.1MB. Experiments on a specific dynamic gesture data set including first person and home monitoring show that the accuracy of gesture classification is 89.7%, which reduces the computational overhead by about 43.6% compared with the traditional 3d-cnn and cnn-lstm models. This study provides an efficient solution for real-time dynamic gesture interaction in resource constrained scenes, and verifies the effectiveness of the fusion of lightweight convolution and sequential state space model.

Downloads

Download data is not yet available.

References

[1] Hu J ,Liu S ,Liu M , et al.ST-CGNet: A spatiotemporal gesture recognition network with triplet attention and dual feature fusion[J]. Pattern Recognition,2025,167111767-111767.

[2] Shaopeng C, Xueyu H .LM-Net: a dynamic gesture recognition network with long-term aggregation and motion excitation[J]. International Journal of Machine Learning and Cybernetics, 2023, 15(4):1633-1645.

[3] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov and L. -C. Chen, "MobileNetV2: Inverted Residuals and Linear Bottlenecks," 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018.

[4] X. Zhang, X. Zhou, M. Lin and J. Sun, "ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices," 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018.

[5] TSM: Temporal Shift Module for Efficient Video Understanding. [J].IEEE transactions on pattern analysis and machine intelligence,2020, PP.

[6] Z. Liu, L. Wang, W. Wu, C. Qian and T. Lu, "TAM: Temporal Adaptive Module for Video Recognition," 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 2021, pp.

[7] Gunawardane H S D P, MacNeil R R ,Zhao L , et al.A Fusion Algorithm Based on a Constant Velocity Model for Improving the Measurement of Saccade Parameters with Electrooculography [J].Sensors,2024,24(2).

[8] Gu A, Dao T. Mamba: Linear-time sequence modeling with selective state spaces[J]. arXiv preprint arXiv:2312.00752, 2023.

[9] Tu C J, Chuang L Y, Chang J Y, et al. Feature selection using PSO-SVM [J]. IAENG International journal of computer science, 2007, 33(1).

[10] Gao Q, Chen Y, Ju Z, et al. Dynamic hand gesture recognition based on 3D hand pose estimation for human–robot interaction[J]. IEEE Sensors Journal, 2021, 22(18): 17421-17430.

[11] Zhang W, Wang J, Lan F. Dynamic hand gesture recognition based on short-term sampling neural networks[J]. IEEE/CAA Journal of Automatica Sinica, 2020, 8(1): 110-120.

[12] De Smedt Q, Wannous H, Vandeborre J P. Skeleton-based dynamic hand gesture recognition[C]//Proceedings of the IEEE conference on computer vision and pattern recognition workshops. 2016: 1-9.

[13] Prakash K S, Kunju N. An optimized electrode configuration for wrist wearable EMG-based hand gesture recognition using machine learning[J]. Expert Systems with Applications, 2025, 274: 127040.

[14] Ma Q, Gu Z, Gao X, et al. Intelligent Hand‐Gesture Recognition Based on Programmable Topological Metasurfaces [J]. Advanced Functional Materials, 2025, 35(1): 2411667.

[15] Pintelas E, Livieris I E, Tampakas V, et al. MobileNet-HeX: Heterogeneous Ensemble of MobileNet eXperts for Efficient and Scalable Vision Model Optimization[J]. Big Data and Cognitive Computing, 2025, 9(1): 2.

[16] Wu R, Liu Y, Liang P, et al. H-vmunet: High-order vision mamba unet for medical image segmentation[J]. Neurocomputing, 2025: 129447.

[17] De Jesus N M, Festijo E D, Apolinario G F D G, et al. Multi-Location and Multi-Feature LMP Forecasting: A 2D Spatiotemporal LSTM-CNN Approach[C]//2025 15th International Conference on Power, Energy, and Electrical Engineering (CPEEE). IEEE, 2025: 207-214.

[18] Boitel E, Mohasseb A, Haig E. MIST: Multimodal emotion recognition using DeBERTa for text, Semi-CNN for speech, ResNet-50 for facial, and 3D-CNN for motion analysis[J]. Expert Systems with Applications, 2025, 270: 126236.

[19] Shahid M A, Raza M, Sharif M, et al. Pedestrian POSE estimation using multi-branched deep learning pose net[J]. PloS one, 2025, 20(1): e0312177.

[20] Zhang W. Dynamic pose recognition based on deep learning: Developing a CNN model for choral conductor pose recognition [J]. Journal of Computational Methods in Sciences and Engineering, 2025: 14727978251323068.

[21] Huang S, Zhang H, Li X. Enhance vision-language alignment with noise [C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2025, 39(16): 17449-17457.

[22] Falisse A, Uhlrich S D, Chaudhari A S, et al. Marker data enhancement for markerless motion capture[J]. IEEE Transactions on Biomedical Engineering, 2025.

Lightweight Dynamic Gesture Recognition based on shufflenetv2-Mamba Hybrid Architecture

Authors

DOI:

Keywords:

Abstract

Downloads

References

Downloads

Published

Issue

Section

License

How to Cite

Cover

CNKI Indexing

Keywords

Latest publications