Abstract:
The speech signal is a one-dimensional function of time that originates from the mouth, nose, and cheeks of the speaker. Lombard Speech, on the other hand, is the speech spoken under the influence of background noise. Speakers automatically alter their method of speaking to improve voice clarity when speaking in a noisy setting. It is anticipated that the Lombard effect, which arises as a result of this adaptation, will significantly affect the effectiveness of automatic speech recognition systems that were not designed to account for it. In this study, we contrast the emotions in speech under regular circumstances with the emotions in Lombard speech and look at how those two types of communication differ from one another. Since there haven't been many theories that discuss employing Machine Learning or Deep Learning to account for the Lombard effect in speech emotion recognition systems, we aim to study how Lombard effect impacts Speech Emotion Recognition systems that use different Machine Learning or Deep Learning models. This study will help services adapt to the emotional state of customers accordingly. People commonly alter their speech production in response to their surroundings (i.e., Lombard Effect). In this paper, we review and compare a number of speech emotion recognition algorithms for both regular audio recordings and those recorded with an induced Lombard effect. The whole speech dataset, comprising audios from multiple different speakers was created and populated by the authors, and the audio features were extracted using the Mel Frequency Cepstral Coefficients feature extraction approach. The machine learning models were trained on a speech dataset recorded in a laboratory and then tested using a different dataset with Lombard speech. The model is also trained using a Convolution Neural Network, as part of the experiment. In comparison to other traditional Machine Learning models, the Convolutional Neural Network model produces higher accuracy for Speech Emotion Recognition, as seen in the results. The accuracies achieved using Lombard speech data are also found to be much lower than those obtained using normal speech data. This can happen because, when speakers perceive external noise, they tend to speak louder to convey the same message with more emotion than they would in a typical, undisturbed setting.