Self-Supervised Learning for Speech-Based Detection of Depressive States

Authors

  • Xinlin Li
  • Changhe Fan
  • Chengyue Su

DOI:

https://doi.org/10.54097/1cspmj65

Keywords:

Self-supervised Pre-training, CNN-BiLSTM, Depression Identification, Speech Detection

Abstract

This study aims to enhance the accuracy of depression detection by leveraging representation learning from audio data. The data of depression speech sets are sparse and costly to annotate. Therefore, a self-supervised pre-training approach is employed to improve the performance, generalization capability, and training efficiency of downstream tasks. When processing unlabeled data, the pre-trained audio representations based on self-supervised learning may be interfered with by noisy data if there is a significant amount of noise or errors present. Consequently, it is necessary to effectively analyze long-distance sequence data to enhance anti-interference capabilities. However, traditional LSTM models have limitations in context extraction and robustness to input outliers. Thus, an improved method named CNN-BiLSTM is proposed in this paper. The network initializes the LSTM's embedding layer with pre-trained word vectors and extracts spatial and temporal features separately to ensure a full and complete expression of useful input information. Different weights are assigned based on the importance of the features to obtain fused features. Additionally, a random forest is used for classification to mitigate the risk of overfitting and to demonstrate good performance when processing high-dimensional data. Experimental results show that the proposed model exhibits good classification performance on the depression dataset, outperforming traditional methods and state-of-the-art investigations.

Downloads

Download data is not yet available.

References

[1] Niizumi D, Takeuchi D, Ohishi Y, et al. BYOL for Audio: Self-Supervised Learning for General-Purpose Audio Representation [Conference Proceedings] // 2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN). 2021 [accessed 2022-01-07]. DOI:10. 1109/ IJCNN52387.2021.9534474.

[2] Zhang P, Wu M, Dinkel H, et al. DEPA: Self-Supervised Audio Embedding for Depression Detection [Conference Proceedings] // PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021. 2021: 135-143 [accessed 2021-01-01]. DOI:10.1145/3474085.3479236.

[3] Sun L, Lian Z, Liu B, et al. HiCMAE: Hierarchical Contrastive Masked Autoencoder for self-supervised Audio-Visual Emotion Recognition [Journal Article] // INFORMATION FUSION, 2024, 108 [accessed 2024-05-20]. DOI: 10.1016/j. inffus. 2024.102382.

[4] Gong X, Duan H, Yang Y, et al. Improving Audio Classification Method by Combining Self-Supervision with Knowledge Distillation [Journal Article] // ELECTRONICS, 2024, 13(1) [accessed 2024-01-29]. DOI:10.3390/ electronics 13010052.

[5] Liu A H, Glass J R, Gan C, et al. Method for self-supervised speech recognition through sparse subnetwork discovery in pre-trained speech self-supervised learning, involves pruning weights of lowest magnitude in new subnetwork regardless of network structure to satisfy target sparsity: US2023360642-A1 [Patent]. [2023-11-20].

Downloads

Published

27-02-2025

Issue

Section

Articles

How to Cite

Li, X., Fan, C., & Su, C. (2025). Self-Supervised Learning for Speech-Based Detection of Depressive States. Frontiers in Computing and Intelligent Systems, 11(2), 106-109. https://doi.org/10.54097/1cspmj65