Abstract:
This research presents the segmentation of single-syllable sounds for speech recognition using an artificial neural
network. The network combines key features from speech signals in the time and frequency domains. The approach involves
dividing speech signals into frames using the short-time energy waveform. Pitch markers are then extracted from the
frames and used as reference points to split them into sections. The sections are further analyzed using window searching
to identify positions, amplitudes, local minimum and maximum values, and maximum slope values, which serve as key
features in the time domain. In the frequency domain, cepstrum coefficients on the Mel scale are used as additional key
features. The two types of key features are combined for speech recognition using the artificial neural network. The study
also compares the performance of the combined and separated key features in the time and frequency domains when
fed into the neural network. The results demonstrate that using the artificial neural network with two input layers (Mel
frequency cepstral coefficient and time domain features) and the same hidden layers yields the highest recognition accuracy
of 96.97% and 88.43% for blind tests.