Cascaded Convolutional Neural Network Architecture for Speech Emotion Recognition in Noisy Conditions

중앙대학교 인공지능인문학연구소

HK+인공지능인문학


제목	Cascaded Convolutional Neural Network Architecture for Speech Emotion Recognition in Noisy Conditions2024-04-13 16:28
작성자	aihumanities
Cascaded Convolutional Neural Network Architecture for Speech Emotion Recognition in Noisy Conditions by Youngja Nam 1 and Chankyu Lee 1,2,* 1 Humanities Research Institute, Chung-Ang University, Seoul 06974, Korea 2 Department of Korean Language and Literature, Chung-Ang University, Seoul 06974, Korea * Author to whom correspondence should be addressed. Sensors 2021, 21(13), 4399; https://doi.org/10.3390/s21134399 Submission received: 25 May 2021 / Revised: 20 June 2021 / Accepted: 24 June 2021 / Published: 27 June 2021 (This article belongs to the Section Sensor Networks) Downloadkeyboard_arrow_down Browse Figures Versions Notes Abstract Convolutional neural networks (CNNs) are a state-of-the-art technique for speech emotion recognition. However, CNNs have mostly been applied to noise-free emotional speech data, and limited evidence is available for their applicability in emotional speech denoising. In this study, a cascaded denoising CNN (DnCNN)–CNN architecture is proposed to classify emotions from Korean and German speech in noisy conditions. The proposed architecture consists of two stages. In the first stage, the DnCNN exploits the concept of residual learning to perform denoising; in the second stage, the CNN performs the classification. The classification results for real datasets show that the DnCNN–CNN outperforms the baseline CNN in overall accuracy for both languages. For Korean speech, the DnCNN–CNN achieves an accuracy of 95.8%, whereas the accuracy of the CNN is marginally lower (93.6%). For German speech, the DnCNN–CNN has an overall accuracy of 59.3–76.6%, whereas the CNN has an overall accuracy of 39.4–58.1%. These results demonstrate the feasibility of applying the DnCNN with residual learning to speech denoising and the effectiveness of the CNN-based approach in speech emotion recognition. Our findings provide new insights into speech emotion recognition in adverse conditions and have implications for language-universal speech emotion recognition. Keywords: cascaded DnCNN–CNN; speech emotion recognition; residual learning

이전	Sustained Use of Virtual Meeting Platforms for Classes in the Post-Coronavirus Era: The Mediating Effects of Technology Readiness and Social Presence	aihumanities	2024-04-13
-	Cascaded Convolutional Neural Network Architecture for Speech Emotion Recognition in Noisy Conditions	aihumanities	2024-04-13
다음	AI시대, 메타버스를 아우르는 새로운 공감개념 필요성에 대한 담론	aihumanities	2024-04-13