disentanglement using pre-trained features
DOI:
https://doi.org/10.54097/tk7x3833Keywords:
Pre-trained, depression detection, speech classification, speaker information.Abstract
This article proposes using pre trained features to address the challenge of detecting depression through speech. Traditional raw audio has shown low accuracy and insufficient generalization performance in depression detection. We use pre trained models that have been developed to extract features, which can be used to extract general feature representations from speech data. During the pre training process, we further decouple the speakers, introducing prior information and providing a better starting point for training downstream models. The results indicate that we achieved the best performance when using the extracted features from the CONTANTVEC pre- trained model with speaker decoupling improvement.
References
C. D. Mathers, D. Loncar, Projections of global mortality and burden of disease from 2002 to 2030,PLoS Med. 3 (2006) e442.
N. Cummins, S. Scherer, J. Krajewski, S. Schnieder,J. Epps, T. F. Quatieri, A review of depression and suicide risk assessment using speech analysis,Speech Commun. 71 (2015) 10–49.
E. Rejaibi et al., “Mfcc-based recurrent neural network for automatic clinical depression recognition and assessment from speech,” Biomedical Signal Processing and Control, vol. 71, p.103107, 2022.
Y . Shen et al., “Automatic depression detection: An emotional audio-textual corpus and a gru/bilstm-based model,” in ICASSP.IEEE, 2022, pp. 6247–6251.
K. Chlasta et al., “Automated speech-based screening of depression using deep convolutional neural networks,” Procedia Computer Science, vol. 164, pp. 618–628, 2019.
M. H. Sanchez et al., “Using prosodic and spectral features in detecting depression in elderly males,” in Interspeech, 2011, pp.3001–3004.
S. P . Dubagunta et al., “Learning voice source related information for depression detection,” in ICASSP. IEEE, 2019, pp. 6525–6529.
Y . Yang et al., “Detecting depression severity from vocal prosody,” IEEE transactions on affective computing, vol. 4, no. 2,pp. 142–150, 2012.
A. Afshan et al., “Effectiveness of voice quality features in detecting depression,” Interspeech, 2018.
N. Seneviratne and C. Espy-Wilson, “Multimodal depression classification using articulatory coordination features and hierarchical attention based text embeddings,” in ICASSP. IEEE, 2022, pp.6252–6256.
L. Yang et al., “Feature augmenting networks for improving depression severity estimation from speech signals,” IEEE Access,vol. 8, pp. 24 033–24 045, 2020.
A. Vázquez-Romero et al., “Automatic detection of depression in speech using ensemble convolutional neural networks,” Entropy,vol. 22, no. 6, p. 688, 2020.
A. Harati et al., “Speech-based depression prediction using encoder-weight-only transfer learning and a large corpus,” in ICASSP. IEEE, 2021, pp. 7273–7277.
J. V . Egas-López et al., “Automatic assessment of the degree of clinical depression from speech using x-vectors,” in ICASSP.IEEE, 2022, pp. 8502–8506.
V . Ravi et al., “Fraug: A frame rate based data augmentation method for depression detection from speech signals,” in ICASSP.IEEE, 2022, pp. 6267–6271.
Wang J, Ravi V, Alwan A. Non-uniform speaker disentanglement for depression detection from raw speech signals[C]//Interspeech. NIH Public Access, 2023, 2023: 2343.
Ravi V, Wang J, Flint J, et al. A Privacy-Preserving Unsupervised Speaker Disentanglement Method for Depression Detection from Speech[C]//CEUR workshop proceedings. NIH Public Access, 2024, 3649: 57.
Y u Zhang, Daniel S Park, Wei Han, James Qin, Anmol Gulati, Joel Shor, Aren Jansen, Y uanzhong Xu, Yanping Huang,Shibo Wang, et al., “BigSSL: Exploring the frontier of large-scale semi-supervised learning for automatic speech recognition,” IEEE Journal of Selected Topics in Signal Processing,vol. 16, no. 6, pp. 1519–1532, 2022.
Edmilson Morais, Ron Hoory, Weizhong Zhu, Itai Gat,Matheus Damasceno, and Hagai Aronowitz, “Speech emotion recognition using self-supervised features,” in Proc. ICASSP,Singapore, 2022.
Hsu W N, Bolte B, Tsai Y H H, et al. Hubert: Self-supervised speech representation learning by masked prediction of hidden units[J]. IEEE/ACM transactions on audio, speech, and language processing, 2021, 29: 3451-3460.
Qian K, Zhang Y, Gao H, et al. Contentvec: An improved self-supervised speech representation by disentangling speakers[C]//International Conference on Machine Learning. PMLR, 2022: 18003-18017.
Ju Z, Wang Y, Shen K, et al. Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models[J]. arXiv preprint arXiv:2403.03100, 2024.
Desplanques B, Thienpondt J, Demuynck K. Ecapa-tdnn: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification[J]. arXiv preprint arXiv:2005.07143, 2020.
Burdisso S, Reyes-Ramírez E, Villatoro-Tello E, et al. DAIC-WOZ: On the Validity of Using the Therapist's prompts in Automatic Depression Detection from Clinical Interviews[J]. arXiv preprint arXiv:2404.14463, 2024.
Downloads
Published
Issue
Section
License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.